Have weak data. But need to make decision. What to do?

Vlad Malik writes:

I just re-read your article “Of Beauty, Sex and Power”.

In my line of work (online analytics), low power is a recurring, existential problem. Do we act on this data or not? If not, why are we even in this business? That’s our daily struggle.

Low power seems to create a sort of paradox: some evidence is better than none, but the evidence is useless. Not sure which it is, and your article hints at the two sides of that conflict.

If you are studying small populations, for example, it might not be possible to collect enough data for a “good sample”. You could collect some. But is such a study worth doing? And is the outcome of such a study worth even a guess? Is too little data as good as no data at all?

You do suggest in your article that “if we had to guess” then the low-powered study might still provide guidance. Yet you also say “we have essentially learned nothing from this study” and later point to a high probability that the effect from such a study may actually be pointing in the wrong direction.

In your critique, you rely heavily on lit review. What if past information is not available or is not as directly relevant, so the expected effect size is vague or unknown (at least to the experimenter’s best knowledge)? In that case, it might not be obvious that the effect is inflated.

Can data collected under such conditions ever be actionable? And if such data is published, how does one preface it? “Use with caution” or “Ignore for now. More research needed”?

What if not acting carries an opportunity cost and we have to act? Do we use the data or ignore it and rely on other criteria? If we say “this weak data supports other indirect evidence”, it might be acting with confirmation bias if the data is not in fact at all reliable. What if the weak data contradicts other evidence? How much power is enough to be worth even considering?

My reply: Indeed, even if evidence from the data at hand is not convincing, that doesn’t mean we have to do nothing. I strongly oppose using a statistical significance threshold to make decisions. In general I have three recommendations:

1. Use prior information. For example, in that silly beauty-and-sex-ratio study, we have lots of prior information that any differences would have to be tiny. We know ahead of time that our prior information is much stronger than the data.

2. When it comes to decision making, much depends on costs and benefits. We have a chapter on decision analysis in Bayesian Data Analysis that illustrates with three examples.

3. As much as possible, use quantitative methods to combine different sources of information:
– Define the parameter or parameters you want to estimate.
– Frame your different sources of information as different estimates of this parameter, ideally with statistically independent errors.
– Get some assessment of the bias and variance of each of your separate estimates.
– Then adjust for bias and get a variance-weighted estimate.
The above is the simplest scenario, where all the estimates are estimating the same parameter. In real life I suppose you’re more likely to be in an “8 schools” or meta-analysis situation, where your different estimates are estimating different things. And then you’ll want to fit a hierarchical model. Which is actually not so difficult. I suppose someone (maybe me) should write an article on how to do this, for a non-technical audience.

Anyway, the point is, when you do things this way, concepts of “power” aren’t so directly relevant. You just combine your information as best you can.

16 thoughts on “Have weak data. But need to make decision. What to do?

  1. There is also an important point to be made about how well we understand the structural relationships involved in the underlying system we are deciding about. If we are unsure about a parameter, (or a small set of them,) but have a significant knowledge base about the domain, we can perform a lot of exact sensitivity analysis. This is typical in phyics, or financial decision making. IF there are too many parameters to perform this type of simple analysis, if there is a significant decision to be made, I would argue for a more robust approach, such as looking at minimizing regret. A set of such techniques have been used (again, in domains where the causal structure is understood) for policy decision making, under the names “Robust Decision Making” (http://www.rand.org/pubs/research_briefs/RB9701.html) or Exploratory Analysis (http://simulation.tbm.tudelft.nl/ema-workbench/contents.html)

    On the other hand, if the domain is like online analytics, as the original question was, there are a lot of places where the structural understanding of the underlying problem is weak – we simply don’t know enough about why people generally make the specific decisions we are interested in, or what influence some arbitrary factor will have, and out best models have limited accuracy. (We also may be doing the analysis in a automated fashion, so that there is no obvious way to apply domain specific knowledge about the phenomena being investigated.) I know Google has worked on modifying its online learning algorithms that use online logistic regression to predict ad clicks based on a insanely many-dimensional feature set, and they definitely think about, and worry about, over-fitting on rare features. I am unsure what specific approaches have been explored in these domains, but my impression is that they tend to be model-free big data approaches that require large datasets – which are difficult for the types of small population questions you are asking.

  2. Specifically “power” is all about whether you can accept or reject a hypothesis. Specifically, the higher power the lower the probability of a “false negative” in the hypothesis testing procedure.

    Well, in a Bayesian analysis instead of using a threshold to conclude that there is or isn’t an effect, you get a whole spectrum of possible effect sizes with their associated probabilities. What you do with that spectrum is up to you, and in Bayesian decision theory, it’s specifically up to how much you value or dislike the outcomes you expect to have if the effect size is any given value in the spectrum.

    In the end, you need to boil things down to a binary “yes or no” or maybe a categorical “choose this or that or the other” action, but the boiling down process in a Bayesian analysis uses information about tradeoffs in the outcomes that Hypothesis testing doesn’t.

    Instead of “power” I think the right terminology for a Bayesian is “informativeness” a study which is not informative will result in a very broad posterior distribution whereas a highly informative study will result in a sharply peaked posterior distribution.

    You may not be doing very informative experiments, but they do inform you somewhat, and making your decision using decision theory AFTER the experiment should be better or at least NO WORSE than doing the decision based entirely on the prior.

    The only concern I’d have is if you are somehow doing very highly biased experiments, so that the information you’re getting is about some sub-population which is different from the one which your decision targets. Assessing that can be very hard in the “online analytics” world.

    • You don’t need to pick a binary or categorical action; as long as you have a value function with inputs that are included in your model, and either risk-neutrality, or an explicit expression for risk-valuation, you can maximize it.

      And for highly biased experiments, as long as you modeled the difference in populations correctly, it’s still workable – but it’s why A/B testing is so popular, since you know the samples are from the same population. (WEll, the same other than the time at which they act.)

      • I would be more concerned that the circumstances and population change from the time the study was done to the time when the model will be used to predict outcomes. For example, the netflix prize algorithm was never used. Why? Because by the time the contest was over everyone was streaming movies rather than getting mailed dvds. They found people want different things when selecting content one way vs the other. There may also be issues with early adopters of a service vs later ones. If your users change dramatically, the model will also perform much worse than expected.

        People think they are estimating predictive skill, what they are really getting is an upper bound for that model. This is better than nothing but it can be highly misleading if you aren’t careful.

  3. In my field (clinical trials) there is an obsession with power, even when there is no binary yes/no decision to be made. People just don’t seem to get beyond lazy binary thinking – it works or it doesn’t. A lot of the time the outcome of a trial isn’t a yes/no decision but information that is an input to some bigger decision-making process. So I tell people they shouldn’t try to make a decision unless they have to, and they look at me strangely.

  4. It’s strange and alarming to read how outside of the financial industry – but in areas where risk and uncertainty are still central – risk management is still such an alien concept. The input is uncertain but often so is the consequent course of action too, and therefore it makes little sense to introduce a crispness bottleneck between the two wave functions. Nowadays we can go fully bayesian by switching from hard decisions to strategy optimization embedded in the input uncertainty frame.

    (disclaimer: I’m working on such a tool complementing stan)

Leave a Reply

Your email address will not be published. Required fields are marked *