I’d like to see a preregistered replication on this one

Under the heading, “Results too good to be true,” Lee Sechrest points me to this discussion by “Neuroskeptic” of a discussion by psychology researcher Greg Francis of a published (and publicized) claim by biologists Brian Dias and Kerry Ressler that “Parental olfactory experience [in mice] influences behavior and neural structure in subsequent generations.” That’s a pretty big and surprising claim, and Dias and Ressler support it with some data: p=0.043, p=0.003, p=0.020, p=0.005, etc.

Francis’s key grounds for suspicion is that Dias and Ressler in their paper present 10 successful (statistically significant) results in a row, and, given the effect sizes they estimated, it would be unlikely to see such an unbroken string of successes.

Dias and Ressler replied that they did actually report negative results:

While we wish that all our behavioral, neuroanatomical, and epigenetic data were successful and statistically significant, one only need look at the Supporting Information in the article to see that data generated for all four figures in the Supporting Information did not yield significant results. We do not believe that nonsignificant data support our theoretical claims as suggested.

Francis followed up:

The non-significant effects reported by Dias & Ressler were not characterised by them as being “unsuccessful” but were either integrated into their theoretical ideas or were deemed irrelevant (some were controls that helped them make other arguments). Of course scientists have to change theories to match data, but if the data are noisy then this practice means the theory chases noise (and the findings show excess success relative to the theory).

I would also like to say that it’s probably not a good idea for Dias and Ressler to wish that all their data are “successful and statistically significant.” With small samples and small effects, this just isn’t gonna happen—indeed, it shouldn’t happen. Variation implies that not every small experiment will be statistically significant (or even in the desired direction), and I think it’s a mistake to define “success” in this way.

Do a large preregistered replication

In any case, the solution here seems pretty clear to me. Do a large preregistered replication. This is obvious but it’s not clear that it’s really being done. For example, in a news article from 2013, Virginia Hughes describes the research in question as “tantalizing” and that “other researchers seem convinced . . . neuroscientists, too, are enthusiastic about what these results might mean for understanding the brain,” and she talks about further research (“A good next step in resolving these pesky mechanistic questions would be to use chromatography to see whether odorant molecules like acetophenone actually get into the animals’ bloodstream . . . First, though, Dias and Ressler are working on another behavioral experiment. . . . Scientists, I have to assume, will be furiously working on what that something is for many decades to come . . .”) but I see no mention of any plan for a preregistered replication.

I’d like to see a clean, pure, large, preregistered replication such as Nosek, Spies, and Motyl did in their “50 shades of gray” paper. I recognize that this costs time, effort, and money. Still, replication in a biological study of mice seems so much easier than replication in political science or economics, and it would resolve a lot of statistical issues.

38 thoughts on “I’d like to see a preregistered replication on this one

  1. “I think it’s a mistake to define “success” in this way.”

    Agreed – this is basically the same thinking that upset me about equating “false positives” with “fraud” in that Schnall article/post a few weeks ago*. It basically assumes that if your theory is right, you will *always* get a statistically significant effect in your study, or else something went “wrong”. But even a high school statistics course will teach you that, even if you believe in one true parameter value out there in the world**, confidence intervals don’t (shouldn’t!) always cover it!

    I think that if I had run 14 small experiments and 10 of them rejected an effect of 0, I’d happily proclaim total success for that phase of the research, and then, if it was a finding as big as “odors associated with triggering bad events in parents can be passed inter-generationally”, I’d do some power calculations and run a large enough experiment to make a really definitive case. That is a big F-ing deal of a finding.

    *http://statmodeling.stat.columbia.edu/2014/11/19/24265/

    **No, no I don’t usually believe in one true parameter value for a population. But in this case I would be willing to believe that there may be some biological/epigenetic mechanism we don’t really understand that allows for things like “reactions to smells associated with danger in the previous generation” to be passed down from parents to children.

    PS/Side Comment – I was reading an popular article today on baby brain functioning, and the first baby they strap in starts crying and squirming and they have to move on to the next baby. And my thought was “well OK, this research holds for docile babies I guess.” And that reminded me of a conversation I had with a guy who worried that the bats he caught in his nets were the bats with the least good echo-location skills – and he studies echo-location! So which mice are they treating and testing? The ones stupid enough not to hide very well when the hand comes reaching in? Sorry – selection in lab animals is a new interest of mine (and a rich vein for jokes).

  2. I’m the one who commented on this paper on this blog before.

    Anyway, I think the science writer Virginia Hughes was a bit carried away when she wrote “other researchers seem convinced”. There were researchers who expressed skepticism in the same article. I have seen other venues where skepticism toward this work was expressed, too. Also note that she came across this work at a meeting before the publication. So, it’s possible that not all the researchers she interviewed had had a chance to examine the data carefully.

    I’m not optimistic that there will be a preregistered replication. This is not my field, but I don’t get the impression that preregistered replications are commonly done. And neither Dias/Ressler nor other researchers have an incentive to do such a replication.

    • Art:

      Your point, about the researchers not having an incentive to do a preregistered replication, is interesting.

      As you say, there is widespread skepticism about this work so one would think that, if the researchers really believed in their phenomenon, they’d have every incentive to replicate, to silence the doubters once and for all.

      On the other hand, if they don’t fully believe it themselves, they have an incentive to quit while they’re ahead and not do too careful a check.

      I feel like I see this a lot: researchers who clearly believe the underlying theory they espouse, and who insist that their data are conclusive, but who would not want to do a pure preregistered replication, I suspect because they have some doubts the that their pattern will really show up again.

      • Andrea:

        I don’t know whether the researchers believe in their theory while unconsciously harboring doubts, or they make conscious decision to game the system, but I often see this kind of attitude. One example is the infamous arsenic DNA study. The lead author seemed to genuinely believe the astonishing conclusion that the bacteria they grew incorporated arsenic into the DNA. But they didn’t try very hard to exclude alternative possibilities. They used instruments and methods that are very specialized and sound impressive, but they didn’t do something more basic. It was as if they were afraid to prove themselves wrong. But, as Feynman said, the first principle is that you must not fool yourself.

        It can hurt the researcher’s career if the researcher is proven to be wrong in a high profile case or the researcher has to retract a paper. But otherwise it can help his or her career because it counts as a publication. I think that is also the case for the ovulation-and-clothing-color study and other studies that Andrew criticizes.

        • Yes, “afraid to prove themselves wrong.” Well put. I think what’s happening is that these researchers have an uncertain grasp of statistics (which is only fair; after all, I only have an uncertain grasp of biology, psychology, etc). At one level, they tend to think deterministically, and they think their statistically significant result represents a sort of proof. But at another level, they understand that real data are messy, and they have the inclination, once they have a big pile of chips in front of them, to step away from the table and cash in their chips.

          In the case of the ovulation-and-clothing study, the researchers did replicate, but they didn’t do a preregistered replication. Doing it in a non-preregistered way gave them enough wiggle room that, although their initial finding indeed did not replicate, they were able to declare victory by finding a statistical significant comparison elsewhere in their data.

          But, as I’ve discussed in various places, I wouldn’t have recommended they do any replication, preregistered or otherwise, using that same design, as it was a classic < a href="http://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/">“power = .06” design that was essentially dead on arrival. I recommended to them several times that, if they want to learn about this phenomenon at all, they need to have much better measurement, they need a within-person design, and they need to read up a bit on when are the most fecund days of the month. As it was, they were just chasing noise. My recommendations just got them annoyed—and that’s fine, they have every right to be annoyed at some outsider who doesn’t believe their claims—but, still, as long as they stick with those noisy measurements, they don’t have a chance.

  3. Related to pre-registration: I’m interested in doing a pre-registered study. How would this work exactly? Do I get to publish it anyway as long as I pre-register it? Are there any psych journals that have a procedure for pre-registration? It’s about time they did.

    Often I don’t want to announce the study publicly, not until I can publish it (conference etc.). So just posting it to some secure portal does not work.

  4. As I said in that previous thread, the problem is not stats and will not be resolved by pre-registration or direct replication. They had poor experimental design (the counter-balancing). Run this script which has no “memory effect” to simulate their figure 1. Simply increase startle response in the presence of *any* scent and decrease startle response from day 1 to day 2 (ie habituation) and you get the same type of results. I leave figuring out the best set of parameters to use for the simulation to others.

    R simulation script:
    http://pastebin.com/xvNMQ1YU

    Representative result:
    http://postimg.org/image/eszz6v723/

    • Confounding would be a good explanation for 10/14 or even 10/10 statistically significant studies.

      There is always a serious concern when studies are not randomised or if the randomisation gets broken – consistent replication might not be signaling successful replication but rather a common important systematic error. Pre-registration or direct replication can be helpful picking these things up when they are subtle (not obvious from evaluating the design design).

      p.s. I could not easily understand your code – though I am sure it is obvious to you.

      • Note: I cannot tell if these posts get through or not. Sometimes it says “waiting for moderation”, sometimes not. Probably some issue with the spam filter.

        Keith, their design and terminology is kind of confusing. The simulation attempts to capture what they have described here:

        “OPS of adult offspring. Mice were habituated to the startle chambers for 5–10 min on three separate days. On the day of testing, mice were first exposed to 15 startle alone (105-dB noise burst) trials (leaders), before being presented with ten odor + startle trials randomly intermingled with ten startle-alone trials. The odor + startle trials consisted of a 10-s odor presentation co-terminating with a 50-ms, 105-dB noise burst. For each mouse, an OPS score was computed by subtracting the startle response in the first odor + startle trial from the startle response in the last startle-alone leader. This OPS score was then divided by the last startle-alone leader and multiplied by 100 to yield the percent OPS score (% OPS) reported in the results. Mice were exposed to the acetophenone-potentiated startle (acetophenone + startle) and propanol-potentiated startle (propanol + startle) procedures on independent days in a counter-balanced fashion.”

        I added some comments and formatted a bit to help clarify
        http://pastebin.com/JaMFT0i2

  5. Hey, I’ve had people come to my lab to do a Master’s degree, take project data that lab members put fundamental contributions into (lit review, designing the study, including items, software development for running study, data analysis), and publish it solely under their own name. It’s rare (2 cases in my 10 years at Potsdam), but very expensive when it happens. If we have a good idea and are running a study, we definitely don’t want to tell the whole world until we’ve tested it empirically. OTOH, how to guarantee honesty in our conclusions? A lot of work in psycholinguistics is openly post-hoc.

    That said, I am considering posting open problems on my home page for prospective students, to encourage people to come work with me. Initially, I was afraid, but now I have such a long list that there’s no way I will do everything in my lifetime, it’s starting to make sense to post these publicly. But this is still different from an experiment I would do myself.

    Is there anyone out there who posts planned work online for the world to see?

    • If they do that, isn’t there anything you can do about it? Complain to the Journal? Their institutions? Anything?

      Isn’t there any formal recourse for these sort of complaints?

      • This sounds like a good reason for per-registration, at least in some cases. I think the Screen Writers’ Guild (if I have the name right) has or had some kind of preregistration that I read about many years ago. You did some kind of quick and dirty outline of your plot and deposited it with the SWG which protected you from having your idea pirated if you made a pitch to a studio and later it was found that they “accidentally had come up with the same idea.

        Situations in the research world are different but in some cases it might be a good way to protect oneself or one’s students and post-docs

  6. @Andrew:

    Were there any papers you can think of that did not use p-values / NHST & yet you critiqued that you sorely wanted them to do a clean, pure, large, preregistered replication?

    Alternatively have you invoked the fishing / garden-of-forking-paths critique against any non-NHST papers?

    • Rahul:

      The “garden of forking paths” is a relevant criticism of null hypothesis significance testing because it directly addresses the claim of a significance test that the observed result is a surprise if the null hypothesis were true.

      That is, a significance test is explicitly a claim about the garden of forking paths, it’s a claim that the researcher would’ve done the same analysis had the data been different.

      Shravan, Rahul:

      Analyses that don’t use significance testing are, of course, subject to many other problems. In particular, Bayesian analyses can have problems with (apparently) noninformative priors. In our paper on multiple comparisons, Jennifer, Masanao, and I talk about how many problems of multiple comparisons go away when we use hierarchical models (which is mathematically similar to connections between false discovery rates and hierarchical modeling noted by Efron); the flip side of this is that if you use flat priors, your Bayesian inferences can be bad, in the sense of giving posterior inferences that are wrong (point estimates that are uncalibrated, interval estimates with poor coverage) in predictable ways.

      • Andrew, I don’t understand this last point:

        ” the flip side of this is that if you use flat priors, your Bayesian inferences can be bad, in the sense of giving posterior inferences that are wrong (point estimates that are uncalibrated, interval estimates with poor coverage) in predictable ways.”

        I generally work with large data sets, and generally fit linear mixed models. I often fit LMMs in Stan (almost on a daily basis now) and use a full variance covariance matrix for the varying intercepts/slopes. I can usually (I’d say almost always) get comparable estimates to the standard tool in psycholinguistics (lmer). I’m pretty sure the Stan estimates using flat priors are not wrong (I can also recover estimates that match parameters that generated simulated data).

        Are you talking here only about small data-sets? One shouldn’t analyze them anyway (unless one has no choice, e.g., in medicine or when studying special populations; in which case your caution makes sense).

        I’m trying to recall where you showed such an example (where flat priors cause serious problems) with a large data-set, and coming up a blank. I guess I’m not being precise as to what “large” means.

        • I’d read this paper before, but I just read it again. I don’t see anything here implying that

          “if you use flat priors, your Bayesian inferences can be bad, in the sense of giving posterior inferences that are wrong (point estimates that are uncalibrated, interval estimates with poor coverage) in predictable ways.”

          The paper shows that multilevel modeling can solve many of the multiple comparison problems that people encounter. It doesn’t show (as far as I can tell) that flat priors lead to bad Bayesian inferences.

          Maybe I missed something. I only found three discussions about priors:

          1. the paper says “we assign non- informative uniform prior distributions to the hyperparameters” (bottom of p. 200) in connection with fig. 6.

          2. Then for the eight schools example the paper says: “The Bayesian analysis here uses a uniform prior distribution on the hyperparameters—the mean and standard deviation of the school effects—and so it uses no more information than the classical analysis”

          3. Finally, in another example the paper says: “Alternatively, a Bayesian analysis with a reasonably uninformative prior distribution (with heavy tails to give higher probability to the possibility of a larger effect) reveals the lack of information in the data (Gelman & Weakliem, 2009). “

        • I think the importance of prior information varies a lot from area to area, from model structure to model structure, and from data-set to data-set. Note also that what seems “uninformative” in low dimensional settings can become “informative” when imposed on some particular model structure in high dimensions.

          For example, you model the (log) price of some asset as a bunch of daily increments over a year or two. Each increment you decide should be normally distributed, you give each normal a standard deviation that is wider than the overall distribution of increments in the previous year. You are putting relatively “uninformative priors” over each increment, because you’re saying you know less than you really do about the size of the increments, but because of your model structure, you are really only putting probability over brownian motion type trajectories. You will completely rule out a massive price crash over 3 days after someone reveals that fraud is going on in the company.

        • Thanks. I understand this example, but I can’t relate to it in the sense that it’s not relevant to the kinds of data I encounter. So I’m still puzzled by Andrew’s comment. I’ll try to come up with an example myself.

  7. How about pre registartion or large sample sizes? Are they of no consequence once one goes the Bayesian route?

    I ask because I don’t remember you having criticised a Bayesian study because it didn’t preregister nor because it used a small sample.

    • Rahul:

      I don’t think I’ve ever criticized a study for not being preregistered. I’ve never done a preregistered study in my life. I’m just saying that, in the case described above, I’d like to see a preregistered replication. That’s not a criticism of the original study (although it can be taken as a criticism of the study’s interpretation).

        • I can imagine wanting a repeated study, especially when the size of the study is small, or the population that is being studied is unclear, but if the Bayesian model makes sense, then it’s extracting the information that’s in the data and giving it to you in a way in which uncertainty is accounted for. The fact that an effect size is large enough to be excluded in a credible interval from zero within a particular model suggests a lot more information about what is going on than “we got p < 0.05 on a test".

          Also, a Bayesian model, represented in Stan/BUGS/JAGS etc *is* a kind of pre-registration of your analysis. Taking a second sample and running the same analysis would in my mind count as pre-registration.

          So, wanting to see a replication vs wanting to see a *pre-registered* replication in the Bayesian context has less of a distinction in my mind. (now, if you write a brand new model for the second analysis… then I'm going to be suspicious, because a Bayesian model is a much more application specific thing than a t-test of differences or whatever).

  8. Pingback: A week of links | EVOLVING ECONOMICS

  9. Whether the results (i.e. p-values) are too good to be true is a moot point since Dias and Ressler use a split-plot design but do not take this into account in the analysis, and hence use the wrong error term. When can meaningless p-values be too good to be true!? It also appears that some samples have gone missing (just count the reported dfs).

    skeptical prior + dodgy likelihood = posterior around zero.

    • Wrong analysis for design; samples apparently gone missing — things that so often get slipped under the rug.

      There ought to be guidelines for authors such as:

      “Explain why you used the analysis you did, and why it is appropriate for the sampling method/experimental design and question being studied.”

      “Discuss any missing data, including possible causes for missingness and how the missingness might affect the results.”

Leave a Reply

Your email address will not be published. Required fields are marked *