Skip to content
 

Do Statistical Methods Have an Expiration Date? (my talk noon Mon 16 Apr at the University of Pennsylvania)

Do Statistical Methods Have an Expiration Date?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

There is a statistical crisis in the human sciences: many celebrated findings have failed to replicate, and careful analysis has revealed that many celebrated research projects were dead on arrival in the sense of never having sufficiently accurate data to answer the questions they were attempting to resolve. The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be? It turns out that when effects are small and highly variable, the classical approach of black-box inference from randomized experiments or observational studies no longer works as advertised. We discuss the conceptual barriers that have allowed researchers to avoid confronting these issues, which arise in psychology, policy research, public health, and other fields. To do better, we recommend three steps: (a) designing studies based on a perspective of realism rather than gambling or hope, (b) higher quality data collection, and (c) data analysis that combines multiple sources of information.

Some of material in the talk appears in our recent papers, The failure of null hypothesis significance testing when studying incremental changes, and what to do about it and Some natural solutions to the p-value communication problem—and why they won’t work.

The talk is at 340 Huntsman Hall.

12 Comments

  1. Ian Fellows says:

    Along this vein, it would be interesting to think about how the need to do things by hand made things different. If you are going to take the time to get out your abacus (or whatever was used back in the day) and sum squares for each observation, you might as well collect a good amount of data and have well defined hypotheses.

    • Jeff says:

      True. Publication took more effort, too: VVE MAY THVS REIECT THE NVLL HYPOTHESIS.

    • zbicyclist says:

      Yes.

      Some years ago, I came across a bootstrap study that had been done by a competitor to allow them to estimate confidence intervals in a complex design. This was done pretty much by mechanical calculators (powered by electricity, but physically moving and making a lot of noise) and a lot of hand work.

      And I used to have a copy of Benjamin Fruchter’s book on factor analysis from 1954, in which you rotated the axes using graph paper.

      Statistically, we used to fish with a rod and reel, and now we can troll with large nets, killing a few dolphins along the way.

  2. Anoneuoid says:

    The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be?

    Assuming you mean testing the default null hypothesis, I don’t think it ever worked. Instead I suspect the “old school” scientists in each field held back the tidal wave of BS being generated by those methods. Once they all died/retired, and the people who knew (eg were trained by) them died/retired, everyone in the next generations is trained to think in unrestrained NHST terms and we get the current situation.

  3. a reader says:

    “(c) data analysis that combines multiple sources of information”

    Getting this to the point where we have methods that are easily and reliably useable by the masses would revolutionize how statistics is used in my opinion.

    The reason I say this is in many of the bio projects I worked on, all the researchers would collect 10 different independent sources of data about a given treatment and seemed to think *all* the results needed to be statistically significant for publication. This is madness, of course: if you powered each of those experiments at 0.8, you’d only have probability about 0.10 of getting all significant…and that’s assuming you don’t do something like reflexively try to control the family wise type I error.

    This has the absolutely backwards incentive to look at *less* sources of data in order to achieve publication…or is huge incentive to not be honest about the data sources you looked at. I pushed for reporting p-values greater than 0.05 with discussion (most bio reviewers will backdown their complaints when you rebut with “our biostatistician said…”) but even this is suboptimal for a variety of reasons.

    If we had easy to use tools that motivated researchers to be more honest about all the data examined, I think that would help push scientific practices in the right direction.

    • a reader says:

      Also, it’s my opinion that the “easy to use” is both the most important and most difficult part of the problem.

      People who frequent this blog tend to favor the Bayesian side of things, yet when they read a paper with p-values and confidence intervals, I *think* they get a reasonable idea of what’s going on with the experiment (or, at least, they could if they a more complete set of p-values and confidence intervals). But the issue is researchers who are *not* so comfortable with statistical methods and quantifying uncertainty misinterpreting things. In my opinion, that should be the target audience that you’re trying to develop methods for to fix how statistics are used in other fields.

    • Keith O'Rourke says:

      Related to discussion came up here https://andrewgelman.com/2018/04/11/failure-failure-replicate/#comment-704694 where I responded from a philosophical perspective.

      But I could add here, that the meta-analysis section I added to the Duke University intro stats course when I was taught it in 2007/8 did not seem to be that difficult for the students compared to the other material.

      I think it just needs to happen in the first course on statistics rather than at say after graduate school at say a JSM meeting half day course…

      • Allan C says:

        Keith:

        I am a believer that philosophical underpinnings of scientific methods should be taught/discussed at a reasonable level before moving on to any treatment of actual methods or at least be taught in parallel. Of course, there is some irony here in that I have little formalized evidence to indicate that this would be any better (better in the scientific sense, not the career sense) then the status quo.

        Andrew:

        Tools, whether they are statistical, managerial, used in building (construction, engineering, etc.), etc. never really have an expiration date. It’s just that some tools are better than others given some amount of information and constraints. In that sense, I think the title is a little misleading because most tools (old and crude, and new and advanced) given certain boundary conditions can be useful; this is true of every tool I can think of at the moment. To be clear I am talking about tools/technologies that are adopted and applied by a non-trivial amount of people (or otherwise would be absent some constraint, e.g. economic); not tools developed by a crack-pot which have no discernible use or advantage over existing technologies under any constraints/information.

        In short, talk seems very cool but I dislike the title, even if it’s only misleading and not wrong (a question can’t be wrong after all!).

        • Keith O'Rourke says:

          Allan C:

          Important point which we did discuss “Astronomers and others would often reflect on how to determine which was data set [study] was the best (thus implicitly assigning weights of 0 to all the remaining data), anticipating that was the obvious solution, but they had yet to learn that, as Stigler (2016) put it, “the details of individual observations [studies] had to be, in effect, erased to reveal a better indication than any single observation [study] could on its own.”

          Honestly, I think its just a matter of scarce time to learn coupled with it being unclear whats most important to learn along with when we are ready to learn (can get it).

  4. bill raynor says:

    I’d suggest context matters here. Agricultural and industrial methods probably still work well if you are using well-measured, “hard” responses (bushels of wheat, lots of widgets, load at failure) and not so well when the “measurements” are subjective and vaguely defined, which pretty well defines psychology and much of biology.

    If one stops pretending that everything is homogeneous, continuous and normal, drops any assumption beyond (partial) order and focuses on choice among real alternatives things seem to work well too. It helps if the experimenter has some skin in the game too (like, say, a market loss leading to losing your job.)

    • Andrew says:

      Bill:

      Yes, note the phrase “human sciences” in the first sentence of the above abstract!

      • bill raynor says:

        Andrew,

        I think we agree. I intended to emphasize the measurement problem and design problems. “Old” techniques work pretty well in areas where you can train (and retrain) the judges/respondents on the “scales” with stable physical standards and all that (some sensory scales.)

Leave a Reply