Noise noise noise noise noise

An intersting issue came up in comments to yesterday’s post.

The story began with this query from David Shor:

Suppose you’re conducting an experiment on the effectiveness of a pain medication, but in the post survey, measure a large number of indicators of well being (Sleep quality, self reported pain, ability to get tasks done, anxiety levels, etc).

After the experiment, the results are insignificant (or the posterior effect size isn’t large or whatever) on 9th of the 10 measures, but is significant on the tenth. . . .

What is the “right” thing to do in that situation?

I recommended fitting a multilevel model. But various commenters pointed out other interesting options regarding what to do next.

In particular, BenK gave a thoughtful but, I believe, misguided suggestion, and I think it’s worth exploring what went wrong in his reasoning. Here’s what he wrote:

For a typical clinical trial, pre-registration is ideal. But regardless, seems to me that the best thing to do is . . . another study, focused on that 10th issue, and the people who suffer from it most. Then you can label for the relief which is offered. In short, the study was hypothesis generating, not conclusive.

I replied:

Sure, it’s fine to do a new study. But . . . (1) before doing that new study, it would be good to have an estimate of the effect size, (2) such an estimate would be useful in designing a new study, and (3) if you do want to do follow-up research, it’s not at all clear you should do exactly one study, and even if you do just one follow-up study, it’s not clear that you should do it on that 9th measure, just cos it happens to have been statistically significant in that one comparison.

The problem—and it is common in statistical analysis—is to take some noisy measure such as a p-value and treat it as truth. Thus, the 9 non-significant results become zeroes and the one statistically significant result becomes a finding—or, in BenK’s more sophisticated version, it becomes a possible finding that needs to be checked. I prefer BenK’s version to that standard version (which one might call the “ESP” or “power pose” or “himmicanes” view of statistics), but it still has big problems.

The p-value is a random variable. It’s noisy! Small meaningless differences in the data can map to apparently huge differences in p-values (going from 0.10 to 0.01, for example).

If you want to decide what to do next, or where to do your next experiment, map out your costs and benefits. It’s a decision problem. P-values (or, for that matter, Bayes factors) are a trap, and if you use them as the basis for what to study next, you’re asking for trouble.

36 thoughts on “Noise noise noise noise noise

  1. When designing the original (hypothetical) study in BenK’s example, could the researchers have done anything differently that would reduce ambiguities in the results and simplify the design of any follow-up studies? In other words, was the original study reasonably designed, or could it have been improved?

    • They could have pre-registered. They could have thought out before hand how they are going to aggregate 10 disparate measurements into one criterion.

      They could have deliberated as to how these measures actually translate into the sort of “effectiveness” a doctor or patient is looking for.

      What caused them to measure these particular 10 things in the first place.

  2. Fully agree and recall the predicament of many of the clinical researchers who followed this – we were funded to study this and found it non-significant but noticed this other thing was significant, so we got funding to study that, and etc. On the third failure they were in serious career trouble.

    On the other hand, p-values could be taken as an initial (crude) default decision analysis recalling http://statmodeling.stat.columbia.edu/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/#comment-160377 (third sentence).

  3. I don’t see BenK treating the p value as truth. In this case, he actually appears to be using the p value for what it was designed for – deeper investigation of something that looked interesting. Without using this piece of data, I’m a little lost with figuring out what my next step would be.

    Am I missing something deeper here?

    • p values are good for detecting differences from existing real highly informed models (ie. the standard model of particle physics, or a model for clinical treatments calibrated based on a couple thousand cases, or an operating characteristic of a manufacturing plant, or something). detecting differences from a “null” model doesn’t really mean much by itself, especially when you’re detecting at marginal levels (p = 0.03 for example) using noisy measurement with a poor/naive model for the measurement error.

      It makes more sense to aggregate information across the 10 outcomes that were measured to try to get a less noisy measurement, and then determine if you can find anything of interest in this aggregated measurement. At that point, if you find a small effect it might make sense to define a new study.

      Basically, squeeze as much useful information out of what you already paid for, and then use that information to inform the next study if one is necessary. The smallish p value isn’t sufficient reason to pour resources into another project without evaluating the decision to use those resources properly.

      • I’d be wary of invoking particle-physics as a defense. The reason that the thresholds for collaborations to declare a true detection are so extreme is the many times in the past (and sadly the present) where people went to press with what we’re found to be noise fluctuations. The 750GeV diphoton results seem to indicate that particle theorists have some of the same bad habits as theIR colleagues in the social sciences.

        • It’s not so much that particle physics is a defense, it’s more that this is a legitimate use of p values, to find out where the data disagrees probabilistically with a well calibrated model. It doesn’t mean that physicists are always interpreting the p values correctly though, as you point out.

        • As someone with a PhD in particle physicists, I disagree. None of the models are well-calibrated — not only are there giant theoretical uncertainties there are also giant measurement uncertainties. In the end it’s the same problem as anywhere else in science: in practice you can’t compute p-values and powers corresponding to realistic likelihoods so you have to resort to all kinds of a approximations which lead to troubling results. The constant drift upwards of the detection threshold is absolutely a compensation for models becoming more complicated (or the signal to background decreasing) and the power rapidly decreasing. You can even see it in the number of false detection papers.

        • Ok, well, *if* you had a decently calibrated model, finding data that was inconsistent with it could be done by looking at p values :-)

          But, my impression is that really the Standard Model and associated stuff used at big particle colliders is pretty well calibrated, supposedly things like the LHC throw out almost all the events they generate because they’re “typical” or “expected”. The lack of calibration you’re talking about is more an issue AFTER you’ve limited yourself to somewhere in the tails of the distribution by throwing out all the “usual” stuff.

          Is that not really the case? Are we seeing lots of “we predicted 93% of proton-proton collisions result in production of X,Y,Z… but it turns out it’s only really 74%?”

        • You are drastically oversimplifying things. Most events don’t pass the trigger not because they are “usual” stuff but because they’re only glancing collisions and in order to find new particles the collision has to be head-on. And that selection process is known only to a few percent relative uncertainty (if you’re being generous about systematic errors). There are higher-level triggers that select out for more novel events, but the selections effects there are more uncertain still: even when you get a head-on collision you have to model what goes on deep within the protons, which is hindered by substantial uncertainties, and then the actual measurement of the resulting explosion that is what is actually used to select on each event.

          Ultimately the largest source of uncertainty in these problems isn’t “the standard model”, it’s trying to model the ridiculously complicated consequences of the model (QCD is really, really hard) or the intricate measurement process through millions of individual channels across an array of detectors, each with their own systematic variation.

        • ” Most events don’t pass the trigger not because they are “usual” stuff but because they’re only glancing collisions and in order to find new particles the collision has to be head-on.”

          So, in my admittedly naive view, the “usual stuff” is glancing collisions and we have a pretty good idea what that produces. So if p < \epsilon under a reasonably calibrated "glancing collision" model then they bother to record it.

          My basic point is, that's a legit use of a p value. It helps you eliminate a bunch of uninteresting stuff. It's a way to design "triggers".

        • I’ve tried designing similar “triggers” for seismic events. For example if you have 3 or 4 seismometers near a fault where you expect small earthquakes, you get a LOT of boring traces. A car drives past on a road, a dumptruck unloads somewhere… a freight train thunders along… If you watch these seismometers for a week you can fit some distribution to test statistics like “total wave energy at all 3 seismometers within a 2 second window”. Looking at 5000 two second windows you can fit some distribution to the observed test statistics. When an observed window has small p value under this distribution, you should probably flag it for study. If you know that the earthquake events are likely to have a specific frequency component, you might want to filter for that component before fitting the statistical model.

          have a bunch of air quality data for the So Cal area and want to detect days when forest fires are likely to be burning? Fit some distribution to peak PM10 and PM2.5 from 200 randomly selected days where you have confirmed no forest fires were burning. Any day that is in the tail of this distribution is a day you might want to look at…

          etc etc. p values are reasonable ways to detect when something unusual *under a specific fitted model* happened. That’s their purpose “probability of something more extreme than this data point under distribution H”

          I get it that actual interesting physics at super high energies with direct head-on collisions is not necessarily easy to do with p values, but uninteresting physics at moderate energies with glancing collisions are maybe pretty easy to filter out with p values?

        • @Daniel: What you are seeing FAR too regularly is detector experiment presenting results like “The largest excess seen in our search for deviations from the Standard Model is a local p-value of 3.9sigma (global 2.0 sigma) at m=750GeV.” Now the experimenters themselves sort of shrug and say “hey it might be just noise because we have seen blips like this come and go before, but we thought you ought to know.” And then theorists run off to concoct whatever Beyond-the-Standard-Model field equations that might be able to fit this bump. The theory space beyond what is iron-clad is MASSIVE and poorly constrained, so you get lots of claims of *new physics* that disappear when more data comes in.

          So even with a really well defined model, the use of local p-values encourages people to go chasing every little noise fluctuation. And I have yet to see a Bayesian full multi-level analysis of this data set like the OP recommends.

        • MattW: that seems reasonable. If you’re forging new territory in observational stuff, and you’ve got brand new fancy detection equipment, and etc etc. it’s not like your model of “what we expect” is likely to be ironclad. And if first-principles calculations using Feynman type techniques or whatever require a gazillion iterations to get the analytical result… then you won’t have that available, so you’re not talking about a p value generated from a theoretical distribution. If you’re trying to fit the extreme tails of an empirical distribution, you’re going to need a LOT of observations. All of these work against using p values as detectors of “amazing and new!” but they still work as a trigger for “spend more effort on analyzing this”

          I don’t ever advocate claiming p < \epsilon implies amazing discovery. But as a way to filter out common stuff from your data so you can spend time looking at the “rarer” bits it’s pretty much what you need.

          For example, in the seismic case, you’re going to throw away all the time windows where “nothing happened” and then you’re going to look at the ones where “something happened” and you’re going to have to now apply your seismic models or whatever to figure out if they make sense, locate the source, the slip mechanism, estimate a magnitude, etc. Some of those more detailed calculations are going to reveal something you still don’t care about (ie. maybe a large long-distance event, or a power surge affecting the whole area, or someone blasting in a local quarry…). So “this is unusual” doesn’t mean “this is AMAZING” by any means.

          It sounds to me like in particle physics there’s just perverse incentives to analyze every little fluctuation, in hopes of eventually getting your name on a particle and a Nobel Prize. The upside is near-infinite so Pascal’s Wager applies :-)

        • Daniel: I see at least two other incentives here: First, quantum mechanics is WEIRD so all sorts of surprising nonsense might actually be real. (That pretty much describes quantum mechanics, in fact.) Second, nobody really “believes” the Standard Model because it’s kind of kludgey; most people think there must be something deeper we can grab onto, but with rare exceptions everything we’ve checked agrees with the SM to somewhat shocking accuracy. So we’re more or less expecting to find something surprising sooner or later. (That and it’s been so long since we’ve seen anything really surprising in a collider experiment that we get rather excited at the possibility. :)

          I don’t think there’s any harm in saying “hey, this might be something cool!” as long as it’s understood that the followup is likely “okay, maybe not”. Which seems to be the case.

          (I’m also a particle physicist. :)

    • John:

      I agree that it’s a good idea to use available, imperfect data as a guide to decide what to do next. The mistake I think is using the p-value to make this decision. The p-value can be super-noisy as a data summary, and it doesn’t directly address the cost and benefit questions that are relevant in decision making.

    • +1

      So long as someone does not decide this is a successful confirmatory study vindicating the use of drug for improving Parameter-10 we are OK I think.

      The right conclusion might be, hey treat this as an exploratory study and lets try to measure only the 10th parameter in the next study. And then if study after study replicates the magic of Parameter-10, well, now we are probably on to something.

      • But from an economics of research perspective (spending your time and scarce resources wisely) it is likely not a good decision.

        On other other hand, setting those considerations aside, you cannot rule out apriori that parameter-10 cannot have an effect (i.e. someone always wins a lottery.)

        • Could be. But then I think it would be more productive if we could see how a better decision framework looked like.

          Not in the abstract, but in this specific example how do we use “imperfect data as a guide to decide what to do next”?

        • Rahul:

          It’s hard to give specific advice in this case because the problem itself is stated so generally. But, again, costs and benefits are relevant to decisions. To see one “p less than .05” result and nine “p greater than .05” results and then to conclude that the best strategy is to just examine one outcome going forward, that seems like way way too much decision making to extract from those very noisy p-value numbers.

        • A full decision framework would likely take the form of a value of information analysis based on a decision analytic model. In this case, some prior on the uncertainty for health gain would be required – not just the uncertainty in parameter 10. The costs and cost savings would need to be modelled, also including uncertainty.

          Given the uncertainty in costs and benefits to the health system a value of information analysis will provide an estimate of the maximum the health system should pay for the research. This maximum value takes into account the fact that research costs health resources which could have been employed in other research or left in the health system.

          A further step is to calculate the value of sample information which takes into account the fact that a trial has a finite size and so will not eliminate but only reduce uncertainty. Larger trials will reduce more uncertainty but will also cost more and take longer to report. The optimal sample size is the balance between all these factors – naturally the optimal sample size may be zero, indicating that research resources should be spent elsewhere.

          https://www.york.ac.uk/che/pdf/tp8.pdf

    • Disclaimer: I’m not a statistician, nor have a stayed at a Holiday Inn recently.

      As Andrew wrote in the post, “The p-value is a random variable. It’s noisy! Small meaningless differences in the data can map to apparently huge differences in p-values (going from 0.10 to 0.01, for example).”

      Without more details into the hypothetical study, I’d really want to know more about the difference between “significant” and “non-significant” results. If, say, we had 9 indicators that had “non-significant” results, with p values between 0.10 and 0.06, and 1 that had a “significant” result of 0.04, why are the 9 indicators “boring” and the 1 indicator “interesting”? Is the distinction actually meaningful?

      Studying statistics reminds me of studying physics: math is a tool to reach an end, not an end unto itself. Getting an answer, such as a p-value, is useless unless you understand why you got the answer. As one of my physics professors once wryly noted in class: “You may solve a problem, and get an answer of 5. One thing I need you to ask yourself is, ‘Why 5? What does it mean for the answer to be 5?’ 0, ok. We love 0. 1, great. We love 1. In quantum mechanics, we re-normalize everything to 1. But why 5? Does that even make sense? If you get 5, you’re in trouble.”

      • An:

        Yes, and I’d go even further. Even if some of the p-values are .10 and others are .01, that’s still not a big deal, not at all a big difference in z-scores. It’s not just that .05 is arbitrary and that .04 is close to .06, it’s that even apparently huge differences in p-values can be easily within noise levels. (That’s the point of my 2006 paper with Hal Stern.)

  4. It should have something to do with the size of the p-value. Using Bonferroni over all your tests, is measure 10 still significant?
    Are we rather talking 0.02 or 10^{-8} here?
    That it’d be useful to do some more things like looking at costs and benefits, effect sizes etc., all agreed.

    • ah, but you’re forgetting the garden of forked paths. What if only 1 test had been done or 200?

      A correct Frequentist approach would be to model the frequency distributions of number of tests in these kinds of medical studies. After getting Pr{tests were performed} then do a N-level Bonferroni correction for each. Keep going like this until you get a “real” p-value rather than a “nominal” one.

      For some reason this reminds me of:

      https://www.youtube.com/watch?v=sdQHRQ-8xLA

    • If someone would like to stay in Fisherland, one could use the combined probability test as a meta-check on the whole experiment. It doesn’t really solve the core of any of the problems with Yes/No detection questions. But as many individuals aren’t going to leave that school of statistics, they ought to bring ever available tool to bear on that data.

  5. I feel like there is a handy statistical lexicon opening here – a decision rule on whether or not to run an n=40 experiment:

    Worth-it Experiment: If your n=40 experiment is statistically significant, it would be worth spending $1M to do a well-measured and fully-powered followup experiment.

    This is a first-pass, but there is something in your blogging recently that made me think you have some sort of heuristic like this is mind. Something like: would this experiment be worth it to you to do it right, supposing your exploratory preliminary experiments pointed to a likely effect?

  6. If you know from the beginning that there are multiple plausible measures of effect that could complicate the analysis, perhaps the analysis associated with any pre-registration should take account of this. Depending on the amount of flexibility you want to allow yourself in later analysis you could have a number of pre-prepared analyses with a Holm-Bonferroni correction, a single automated analysis subject to cross-validation or bootstrapping, or complete freedom validated with a holdout sample.

    • Andrew M.:

      I’d prefer to fit a hierarchical model and estimate all the effects of interest at once, and recognize my posterior uncertainty, rather than trying to twist my analysis into something statistically significant.

    • Definitely, and in real life a clinical trial like this would almost definitely have a pre-registered plan to take into account the multiple endpoints.

      In case anyone is interested, here is the FDA’s guidance for industry for dealing with Patient Reported Outcomes like these http://www.fda.gov/downloads/Drugs/…/Guidances/UCM193282.pdf – the section on stats starts on page 27, it covers multiple endpoints, composite endpoints, and touches on how non-prespecified analysis will be considered. If you are really interested in the principles that the industry is meant to follow you can search for E9 Statistical Principles for Clinical Trials & the updated Helsinki Declaration.

      There should be a new specific guidance document for the analysis of multiple endpoints in clinical trials later this year, which will hopefully give more clarity on what is acceptable for submissions to that agency.

Leave a Reply

Your email address will not be published. Required fields are marked *