My talk in Amsterdam tomorrow (Wed 29 Oct): Can we use Bayesian methods to resolve the current crisis of statistically-significant research findings that don’t hold up?

The talk is at the University of Amsterdam in the Diamantbeurs (Weesperplein 4, Amsterdam), room 5.01, at noon.

Here’s the plan:

Can we use Bayesian methods to resolve the current crisis of statistically-significant research findings that don’t hold up?

In recent years, psychology and medicine have been rocked by scandals of research fraud. At the same time, there is a growing awareness of serious flaws in the general practices of statistics for scientific research, to the extent that top journals routinely publish claims that are implausible and cannot be replicated. All this is occurring despite (or perhaps because of?) statistical tools such as Type 1 error control that are supposed to restrict the rate of unreliable claims. We consider ways in which prior information and Bayesian methods might help resolve these problems.

I don’t know how organized this talk will be. It combines a bunch of ideas that have been floating around recently. Here are a few recent articles that are relevant:

[2014] When do stories work? Evidence and illustration in the social sciences. {\em Sociological Methods and Research}. (Andrew Gelman and Thomas Basboll)

[2014] The AAA tranche of subprime science. {\em Chance}. (Andrew Gelman and Eric Loken)

[2014] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. {\em Journal of Management}. (Andrew Gelman)

[2013] Is it possible to be an ethicist without being mean to people? {\em Chance} {\bf 26} (4). (Andrew Gelman)

[2013] It’s too hard to publish criticisms and obtain data for replication. {\em Chance} {\bf 26} (3), 49–52. (Andrew Gelman)

[2013] Convincing evidence. For a volume on theoretical or methodological research on authorship, functional roles, reputation, and credibility on social media, ed.\ Sorin Matei and Elisa Bertino. (Andrew Gelman and Keith O’Rourke)

[2013] To throw away data: Plagiarism as a statistical crime. {\em American Scientist} {\bf 101}, 168–171. (Andrew Gelman and Thomas Basboll)

[2013] Difficulties in making inferences about scientific truth from distributions of published p-values. {\em Biostatistics}.(Andrew Gelman and Keith O’Rourke)

[2013] The problem with p-values is how they’re used. {\em Ecology}. (Andrew Gelman)

[2013] Interrogating P-values. {\em Journal of Mathematical Psychology}. (Andrew Gelman)

[2013] They’d rather be rigorous than right. {\em Chance} {\bf 26} (2), 45–49. (Andrew Gelman)

[2012] P-values and statistical practice. {\em Epidemiology}. (Andrew Gelman)

[2012] Ethics and the statistical use of prior information. {\em Chance} {\bf 25} (4), 52–54. (Andrew Gelman)

Design analysis, prospective or retrospective, using external information. (Andrew Gelman and John Carlin)

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. (Andrew Gelman and Eric Loken)

Also a few more that we are in the midst of completing.

I’m hoping that giving this talk will help me get these thoughts in order.

P.S. I gave the talk. It took a bit more than an hour. The slides are here, but lots of my best riffs were in the spoken version only.

19 thoughts on “My talk in Amsterdam tomorrow (Wed 29 Oct): Can we use Bayesian methods to resolve the current crisis of statistically-significant research findings that don’t hold up?

  1. One point you make in your articles is that making simple yes/no decisions based on straw-man null hypotheses is not the way to go. How can this be translated to real change? You seem to be asking a lot from the user of frequentist methods.

    In one of your papers you also say that there are serious problems with using non-informative priors, and that one should use informative priors:

    P-values and statistical practice:
    there are “serious problems with apparently noninformative or weak priors.”
    and
    “I think we cannot avoid informative priors if we wish to make reasonable unconditional probability statements.”

    You also say: “…in real problems, prior information is always available and is often strong enough to have an appreciable impact on inferences.”

    I often find myself fitting linear mixed models (usually, varying intercept and varying slopes models of the type discussed in Gelman and Hill 2007). These are fairly typical analyses in psychology and linguistics. My experience has been that the likelihood always overwhelms the prior, even when you start with a tight prior distribution. This is because I usually have a lot of data (I make sure I have a lot of data, just because I can get it).

    Fitting informative priors is not going to change much for people with my kind of data, i.e., the typical psychology experiment. (Am I wrong about this? I’ve fit a lot of models over the last year, so I’m only basing this on a little bit of experience with typical data analysis problems.)

    There is a real disconnect between what you write and recommend and the reality of the data analysis problem facing psychologists and linguists.

    In addition, I would prefer to use non-informative priors as well as informative ones, and as I understand it it’s vital to do that in order to investigate the role of the prior. If non-informative priors have serious problems, where does this leave us?

    Finally, you want to move away from yes/no decisions, but that is how people reason about their theories. They try to come up with decisive tests that rule out one theory and are consistent with the predictions of another. If you take that away, what is an experimental paper going to look like? This is orthogonal to the p-value question. Even if I fit a bayesian model, where one theory predicts that theta>0 and the other that theta<0, I am still going to make a yes/no decision based on my posterior distribution.

    • I think you are asking excellent questions and I’d love to hear an answer as well. I’m not from linguistics but I often face very similar challenges to the ones you describe.

      I especially strongly like the point about yes / no decisions. I’ll go beyond experimental papers & reasoning about theories and mention that most real world applied decisions (e.g. does this reactor work; should I build a steel pipeline or concrete; should we sell peach yogurt or mango; does paint prevent rust better than epoxy) finally are inherently constrained to be yes / no decisions.

      Moving away from yes / no decisions may be intellectually satisfying but as actionable strategy seems hard. I may be doing things wrong. I’d love to hear from others.

      • I don’t think the yes/no is the problem in scientific culture (although some people oversimplify, that’s not a statistical problem per se) as much as the straw man. Imagine you have a new measure and you compare it against the old, established, “validated” (whatever that means…) measure. Of course it’s an abuse, but it is quite common to see people running correlations and reporting the default p-value that come out of the computer. What was the default null hypothesis? r=0. Which we knew all along to be incredibly unlikely. There are countless more subtle, hard to spot, versions of this same error. And yeah, it’s kind of hard to be more informative to your clients/readers while also clinging to hardcore frequentism… but we owe it to them to help them understand that their decisions always have some risk attached, to try and quantify that uncertainty: whether they might build the reactor on the best evidence and still have it smashed by an earthquake. Or to put it another way…

        “Given the number of times in which an unknown event has happened and failed: Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” (Bayes, T. 1763)

        • I guess my question really is: suppose the psychologists and linguists etc. were to suddenly start listening to Andrew, what does Andrew think would be the ideal kind of experimental paper that they would be publishing?

          There seems to be a gap between Andrew’s goals when he looks at a data set, and a psychologist’s or a psycholinguist’s goals. Unless these goals are aligned, there’s no way that Andrew’s recommendations will become mainstream in these disciplines.

      • Maybe my vore is very naive, but here are my 2 cents on binary decisions:
        When we do experiments to arbitrate between competing theories, we do not need to declare one to be the winner. instead we could compare the (Bayesian) evidence for the competing theories and express how much more likely we believe one theory to be true than competing theory.
        About the real world decisions: It is true that these are often binary, but that does not mean that statistical methods we use to support action selection equally must have binary outcomes. An alternative decision theoretic approach is to combine the estimated probabilities for outcomes of actions (e.g. the probability to increase graduation rates through more tutoring) with the associated costs and benefits in order to choose actions based on expected values.

      • I’m not sure I understand where the yes/no problem is arising; that is, “Even if I fit a bayesian model, where one theory predicts that theta>0 and the other that theta<0, I am still going to make a yes/no decision based on my posterior distribution." and "Moving away from yes / no decisions may be intellectually satisfying but as actionable strategy seems hard." seem like non-issues to me.

        Wouldn't "write down your loss function and compute your expected loss for choosing each of yes/no" be the standard Bayesian answer? It seems to me that the usual Bayesian decision analysis is already set up to accommodate making yes/no decisions on the basis of a model of reality that is more nuanced than just designating each considered model as true/false.

        Maybe I'm misunderstanding you all or bayesian D.A. somehow?

        • Co-incidentally, I’m learning about decision theory in my MSc course these days so I actually understand what Phil means (this would not have been true two weeks ago)!

          But isn’t it true that I am already done if P(theta>0) is much larger than P(theta<0)? I don't need to compute any loss function if the former is 0.99 and the latter 0.01. In most studies of the type that people like me do, we set up experiments where we have a decisive test like this for theory A and against theory B.

          I can see the value of decision theory in making, say, business decisions, where the optimal action is not obvious, but in such studies, I can't see its relevance (I would love to see how I can use decision theory usefully for such problems; is there any published example of a psychological study that uses loss functions to figure out what do conclude?).

        • Shravan:

          In some way the problem is with the focus on “theta.” Effects (and, more generally, comparisons) vary, they can be positive for some people in some settings and negative for other people in other settings. If you’re talking about a single “theta,” you have to define what population and what scenario you are thinking about. And it’s probably not the population of Mechanical Turk participants and the scenario of an online survey. If an effect is very small and positive in one population in one scenario, there’s no real reason to be confident that it will be positive in a different population in a different scenario.

    • I also think that these are excellent questions Shravan, and I am glad you are asking them.

      You write that fitting with informative priors isn’t going to change matters much in the experimental settings, but I wonder whether this is always the case. I frequently enter into discussions with other psycholinguists about fitting so-called “maximal random effects” linear mixed models like those recommended by Dale Barr and his colleagues. Now, from what I have seen, there are some convergence issues with some of these models. I wonder whether mildly-informative mixed effects models, like the approach taken with blmer, will help with this. I think that this has not been fully explored yet — at least I am not aware of it.

      • Dear Doug,

        I think that even with my limited experience I can go out on a limb and say that, provided you have enough data (and I always make sure I do!), your likelihood will overwhelm the prior (unless perhaps you have an implausibly “tight” prior). All those exercises on conjugate priors that show that the posterior mean is a weighted average of the prior and the likelihood also illustrate that point analytically.

        About the Barr et al recommendation to fit maximal models requirement (see here:http://idiom.ucsd.edu/~rlevy/papers/barr-etal-2013-jml.pdf), you will find that the lmer function just returns a convergence failure or the like and returns a model fit, but JAGS will simply crash when you try to fit a maximal model blindly. My understanding is that convergence failures happen if there is not enough data to estimate all the variance components specified in the maximal model.

        What you say makes sense: use an informative prior for the case that you don’t have enough data. But what I don’t understand here is: why would I have priors for my random effects’ variance components? I don’t have any sensible beliefs about them other than that they are probably greater than zero; I only have prior beliefs about my coefficients. The random effects are just nuisance variables for me. The only plausible prior I can have for my random effects is a flat prior bounded by 0 and my historical upper bound for the observed variance.

        • My understanding is also that, provided you have ‘enough’ data (and the prior is not too wacky, i.e. doesn’t assign 0 density to possible outcomes), the likelihood will dominate the prior in determining the posterior.

          Is it really the case standard case in psych that you have ‘enough’ data, though? All the talk of severely underpowered studies would seem to suggest otherwise, and my various reads of psych pubs have never left me with the impression that getting a desired N is easy, especially if the research is conducted with a group other than undergrads. I’m suspicious you may just be especially assiduous in recruiting participants, Shravan!

        • I have to admit that you are right, Phil. My comment does not square with the fact that people run studies with low power, and people will happily argue for the null with such underpowered studies.

          People within psychology periodically point this problem out to the community (and I include psycholinguists in this community), but things don’t seem to change much. The most common explanations I have heard for running underpowered studies are that it’s hard to find enough subjects, or that it’s too expensive to run many subjects. But if you’re willing to run fewer studies you can get enough subjects, whatever enough is! I guess it’s probably very hard to accept that one will be publishing much less per year.

    • Good questions (as others add) which all should be addressed in a good graduate education in Bayesian (applied) statistical practice (so I am guessing Andrew can’t answer in a short comment). You might wish to read the student notes for Andrew’s G+ hangout BDA course here http://www.stat.columbia.edu/~gelman/book/220lecturenotes.pdf. I am hoping G+ will get past the 10 participants limit but doesn’t seem promising.

      Now my other guess is that more often than not these are left out with instead a primary focus on Bayes theorem to get a posterior from a (nuisance) prior to get the immediate answer to everything the (one and only?) posterior distribution. The math is hard and MCMC sampling very interesting so lots to learn.

      As a for instance of needing informative priors, take the bit at the very end of http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics9.pdf , that spiral of confusion spun by underpowered studies. How does one get out of that spiral with some informative prior information about the unknown size of the effect?

  2. Hello Andrew,

    I think the pdf of your talk is corrupt (I redownloaded it multiple times and I can’t open it, no matter which pdf-viewer is used). Could you please reupload it?

    Thanks,
    Robert

  3. Pingback: Somewhere else, part 87 | Freakonometrics

Comments are closed.