Skip to content
 

Uri Simonsohn warns us not to be falsely reassured

I agree with Uri Simonsohn that you don’t learn much by looking at the distribution of all the p-values that have appeared in some literature. Uri explains:

Most p-values reported in most papers are irrelevant for the strategic behavior of interest.

Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data. Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?”

He demonstrates with an example and summarizes:

Looking at all p-values is falsely reassuring.

I agree and will just add two comments:

1. I prefer the phrase “garden of forking paths” because I think the term “p-hacking” suggests intentionality or even cheating. Indeed, in the quoted passage above, Simonsohn refers to “strategic behavior.” I have not doubt that some strategic behavior and even outright cheating goes on, but I like to emphasize that the garden of forking paths can occur even when a researcher does only one analysis of the data at hand and does not directly “fish” for statistical significance.

The idea is that analyses are contingent on data, and researchers can and do make choices in data coding, data exclusion, and data analysis in light of the data they see, setting various degrees of freedom in reasonable-seeming ways that support their model of the world, thus being able to obtain statistical significance at a high rate, merely by capitalizing on chance patterns in data. It’s the forking paths, but it doesn’t feel like “hacking,” not is it necessarily “strategic behavior” in the usual sense of the term.

2. If p-values are what we have, it makes sense to learn what we can from them, as in the justly influential work of Uri Simonsohn, Greg Francis, and others. But, looking at the big picture, once we move to the goal of learning about underlying effects, I think we want to be analyzing raw data (and in the context of prior information), not merely pushing these p’s around. P-values are crude data summaries, and a lot of information can be lost by moving from raw data to p-values. Doing science using published p-values is like trying to paint a picture using salad tongs.

12 Comments

  1. Economist says:

    My perspective may be (very) slightly different from yours. I think people who come at any empirical analysis from a data-first perspective don’t understand the relationships they are trying to estimate. So discussions revolve around whether an effect/rlship is true or false. In social science, at least, the discussions should involve i) when is something true ? ii) how large is the effect ? iii) what is the cost of being wrong (which cannot be separated from the “why do you want to know” question.)

    From this perspective, all paths in the “garden of forking paths” are useful. Together, all paths tell a story. So say, you find that some relationship was significant with one model specification and not in another. Then the main work is trying to understand why, and then seeing if this story hold in other samples and can be replicated. So then trying to figure out if this was just sample variability.

    I see the over-emphasis on significance a consequence of analyses done by groups at two extremes :
    i) people who know their methodology and have a poor understanding of the domain. Frankly, how many applied stats papers or application sections in theory papers, start with some model that the statisticians just pulled out of a hat and then go on to expend a lot of energy trying to estimate parameters (means and/or distribution). The resulting analysis will only be as good as the weakest link – in this case the weakest link is the atheoretical model specifications – which are often just shite.

    ii) people who have no understanding of statistics.

    There may be some overlap (e.g. your ovulatory cycle and himmicane favorites).

    Unfortunately both groups may be growing.

    • Andrew says:

      Economist:

      Yes, I agree that all paths in the garden are potentially useful, and I think the right approach is for researchers to analyze all paths rather than picking a single data analysis out of the many choices available.

      • jrc says:

        Yeah, but the trick is in interpreting the differences in coefficients across specifications. Many theories have competing predictions about effects across/within different groups – for example, consider an intervention that improves child health for one child in a household. The across household comparison might tell us the effect of the intervention on biological outcomes, while the within-household comparison might tell us about complements/substitutes to child health investment in household decision making. Another example is the deworming paper… whatever your thoughts on the replication, the identification strategy was quite clever.

        In fact, I think this is an under-utilized way to do non-experimental causal inference. It is like a secret-weapon approach on a single dataset. Uri is working on this too – Specification Curve: Descriptive and Inferential Statisics for all Plausible Specifications – but I think that work could be pushed further by addressing the fact that different specifications are actually providing us with different kinds of information.

      • Economist says:

        I agree, mostly. But where they may be a difference is that I don’t believe that the optimal model can be determined by the data (say by minimizing some objective function). I think that statistical methodology is an aid to understanding social science, not the final arbiter. Ultimately the data and statistics, common sense and knowledge of the subject area all important.

  2. Keith O'Rourke says:

    > I think we want to be analyzing raw data
    Yep, and here is some evidence on that (full clinical study reports provide the raw data)

    Key paper:
    Tom Jefferson, et al (of The Cochrane Collaboration). Risk of bias in industry-funded oseltamivir trials: comparison of core reports versus full clinical study reports http://bmjopen.bmj.com/content/4/9/e005253.full

    The conclusion:
    “This approach is not possible when assessing trials reported in journal publications, in which articles necessarily reflect post hoc reporting with a far more sparse level of detail. We suggest that when bias is so limiting as to make meta-analysis results unreliable, either it should not be carried out or a prominent explanation of its clear limitations should be included alongside the meta-analysis.”

  3. Tom Passin says:

    I once was lead on a project to measure currents on a scale model of a large aircraft during a simulated EMP (ElectroMagnetic Pulse) event. We developed an experimental apparatus and sensors, and got good measurements. I also developed a simplified, physically based calculation that agreed closely with the data. My colleague developed a multi-mode, more abstract model that also came up with results similar to the data.

    However, I learned that his model had 9 adjustable parameters. I asked him to explore the parameter space, and if the results varied a lot, that he had to be able to justify his original choice of the parameter values.

    Well, exploring the parameter space quickly showed that their values mattered a lot, but he was never able to justify his original values. Apparently he just had gotten lucky. So in the end, we couldn’t use his analysis .

  4. Mayo says:

    It may be falsely reassuring to ask:“Do researchers p-hack everything?”, because presumably researchers aren’t that bad, but researchers may well always “fork”.

  5. Neil Malhotra says:

    Alan Gerber and I addressed this issue by painstakingly only culling the p-values associated with researcher’s key hypotheses of interest:

    http://www.nowpublishers.com/article/Details/QJPS-8024

Leave a Reply