“An Experience with a Registered Replication Project”

Anne Pier Salverda writes:

I came across this blog entry, “An Experience with a Registered Replication Project,” and thought that you would find this interesting.

It’s written by Simone Schnall, a social psychologist who is the first author of an oft-cited Psych Science(!) paper (“Cleanliness reduces the severity of moral judgments”) that a group of researchers from Michigan State failed to replicate as part of a big replication project.

Schnall writes about her experience as the subject of a “failed recplication”. I like the tone of her blog entry. She discusses some issues that she has with how the replication of her work, and the publication of that work in a special issue of a journal was handled. A lot of what she writes is very reasonable.

Schnall believes that her finding did not replicate because there were ceiling effects in a lot of the items in the (direct) replication of her study. So far so good; people can mistakes in analyzing data from replication studies too. But then she writes the following:

It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.

She may be talking about the peer-review process here, it’s not entirely clear to me from the context. But what am I to think of her own data-analysis skills and strategies if it takes her “considerable time and effort” to discover the ceiling effects in the data? Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. Step 1: exploratory data analysis! If she’s right about how this aspect of the data accounts for the failure to replicate her original finding, of course that makes for an interesting story—we need to be critical of failed replications too. But the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.

My reply:

On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.

But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge. For example, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to.” My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.

Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about replication per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to criticism.

Regarding the specifics of her blog post, she has a lot of discussion about how the replication is different from her study. This is all fine, but the flip side of this is that why should her original study be considered representative of the general population? Once you accept that effect sizes were vary, this calls into question the paradigm of making quite general claims based on a study of a particular sample. Finally, I was disappointed that not anywhere in her blog does she consider the possibility that maybe, just maybe, her findings were spurious. She writes, “I have worked in this area for almost 20 years and am confident that my results are robust.” The problem is that such confidence can be self-reinforcing: once you think you’ve found something, you can keep finding it. The rules of statistical significance give a researcher enough “outs” that he or she can persist in a dead-end research paradigm for a long time. Just to be clear, I’m not saying that’s what’s happening here—I know nothing about Schnall’s work—but it can happen. Consider, for example, the work of Satoshi Kanazawa.

That said, I understand and sympathize with Schnall’s annoyance at some of the criticisms she’s received. I’ve had similar feelings to criticisms of my work: sometimes people point out actual and serious errors, sometimes mistakes I’ve made of over-generalization, other times I’ve misconstrued the literature in some way. Just recently my colleagues and I made some changes in this paper because someone pointed out some small ways in which it was misleading. Luckily he told us while the paper was still being revised for publication. Other times I learn about problems later and need to issue a correction. But sometimes I do get misinformed criticisms, even published papers where someone misunderstands my work and slams it. And I don’t like it. So I see where she’s coming from.

Again, from a statistical perspective, my biggest problem with what Schnall writes is that she doesn’t seem to consider, at all, the possibility that her original results could be spurious and arise from some combination of measurement error and capitalization on noise which, looked at a certain way, might include some statistically significant comparisons. I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences. There are certainly differences between any two experiments on the same general phenomenon—that’s why we do meta-analysis—but, as Kahneman and Tversky demonstrated four decades ago, researchers tend to systematically underestimate the importance of variation in interpreting experimental results.

Following up, Salverda writes:

In the meantime, I came across two papers. One is by Schnall (in press, I believe), a response to the paper that failed to replicate her work.

And the people who failed to replicate her work [David Johnson, Felix Cheung, and Brent Donnellan] wrote a paper that addresses her criticism.

31 thoughts on ““An Experience with a Registered Replication Project”

  1. As a nonacademic outsider, I must say reading this blog (among other things) makes me sad and cynical about academic research! My family is tired of hearing about it at dinner. Arrogance like Schnall’s turns my stomach. There is so much wrong with a first-past-the-post attitude toward published results, and the idea that replicators bear any special burden of proof, that it’s jaw-dropping to see them seriously defended.

    And in the bigger picture, Schall and the Johnson group are quibbling over generalizations from dubious, WEIRD convenience samples. “Moreover, we emphasize that there was no a priori reason to suspect that the SBH dependent variables would be inappropriate for
    use with college students from Michigan because they had been originally been developed for use with college students from Virginia.” (Johnson et al. 2014)

    • As an academic insider, reading this blog makes me cynical about the majority of social science research and the peer review process. At the same time, however, I am hopeful that many of the statistical, conceptual, and philosophical viewpoints Andrew touches on will gain more attention in the social sciences and be featured more prominently in graduate curricula.

    • Kyle:

      You write of “arrogance like Schnall’s.” I don’t know Schnall but my guess is that the problem is not arrogance so much as a naive overestimation on her part of the evidence provided by statistically significant comparisons. In a sense the fault is with all of us who teach statistics and write statistics textbooks and write research papers in which we have a p<0.05 result and then treat it as true. I was struck by Schnall's discussion in that nowhere did I see her say, "Hey, maybe my result was just a lucky break in the data, not a real pattern in the general population." It would be open-minded for a researcher to consider this possibility. But to not consider it seems more of a sign of misunderstanding of statistics, rather than arrogance.

  2. The stability of a phenomenon is logically separate from post data statistical manipulations, and can’t be inferred from such manipulations.

    If you really want to make progress, do what physicists do (or used to do) and verify the phenomenon are stable before theorizing about it.

  3. Here’s a proposal which is guaranteed to end the crises of reproducibility:

    Every time a researcher invokes a frequency distribution and calls it a “probability distribution”, they should be required to supply EXPERIMENTAL evidence that the phenomenon in question really does have stable frequency distributions under the precisely defined circumstances they claim for it.

    Problem solved.

    • In addition, every time a researcher invokes a belief distribution and calls it a “probability distribution,” they should be required to really, really believe in the most credible values, under the appropriate set of assumptions.

      • Deal.

        If I assign a N(0,1mm) to an error in a single measurement from a ruler with accurate divisions down to 1mm, then I “really, really, really” will believe the error in my measurement isn’t ~ 1000 km.

      • I’m curious to know what objection anyone could possibly have to my suggestion. Everything depends on those distributions being stable for classical statisticians, and there’s no way to know without experimentally checking it over a wide range of conditions.

        Moreover it can be done. When physicists first started investigating diffraction patterns from crystal lattices, which are frequency distributions to the extent the things passing through the crystal are “particles”, the first thing on the agenda was to verify the patterns were stable and not completely different every time they did the experiment.

        So I’ll hazard a guess as to why classical statisticians with their beloved “guarantees” don’t insist on the one thing that stood a chance of make the guarantees a realty:

        it would reduce the applicability of classical statistics to about 1000th of what it is today.

        • To be clear, I was joking about the fact that beliefs can’t be objectively verified, not disagreeing with the proposal.

          Note that in your first comment, you say that the frequency distribution should be demonstrated “under the precisely defined circumstances they claim for it,” whereas in your later comment you say that they should be checked “over a wide range of conditions.” These don’t seem entirely consistent with one another.

          In either case, one possible problem is that the (allegedly) stable distribution in question is the distribution of a test statistic under the null hypothesis, the truth of which we don’t actually know.

          I’m pretty confident that, right or wrong, many experimental psychologists would object and argue that it’s too expensive, time-consuming, and/or difficult to replicate most psych experiments enough times to get a frequency distribution capable of providing any kind of compelling evidence of distributional stability.

        • They are completely consistent. Any setup will have parameters which are specified (fixed) and others which aren’t (and thus potentially vary). So the stability has to be checked over those fixed parameters AND various values of the non-fixed ones.

          Your “one possible problem” is doubtless a big problem sometimes for classical statisticians. How are they going to verify P(x|H_0) when H_0 is never true? I’m sure there are examples were this is resonable, but some where it is not.

          That’s their problem. They are the ones who insist on interpreting probabilities as frequencies because they can’t imagine any other interpretation. That’s the bed they made; we should insist they sleep in it.

          Doubtless it of is expensive. If you assert that every sequence of n=1000 occurs equally often, then indeed it’s going to be VERY expensive to prove it.

          Again, that’s their problem. They are the ones who insist on interpreting probabilities as frequencies because they can’t imagine any other interpretation. That’s the bed they made; we should insist they sleep in it.

        • I genuinely don’t see any way around this. If classical statisticians want probabilities to be stable limits of frequency distributions, and they base the correctness of everything they do on this, then we should insist that they experimentally verify such a stable limit exists before ever discussing what shape it is and what that implies.

          Without that the whole exercise is a joke.

          If Frequentists actually put their money where their mouth is on this, it would have the happy consequence that classically statistics is only applicable to a narrow range of problems where frequentism does little damage.

    • Here’s the appropriate response whenever a Classical Statistician says “the sampling distribution of x is P(x)”:

      “I don’t want your opinion or belief as whether a stable limiting frequency distribution of x exists, I want physical proof that it’s stable and has a limit before discussing it’s shape.”

      • I’m no fan of subjective Bayes, but at least when subjective Bayesians have an opinion/belief about some parameter like the speed of light, at least the parameter exists.

        If Classical Statisticians don’t want to physically verify that stable frequency distributions exist and have a limit, then their sampling distribution P(x) is just their opinion/belief about something that likely doesn’t even exist.

        Why should opinions about things that don’t exist be accorded greater scientific respect than opinions about things that do?

  4. In my earlier years when I was still doing basic research in social psychology, I ran into replication issues on a regular basis. Only the most widely studied phenomena ever replicated. When a phenomenon has thousands of papers published on it, it’s generally robust. Those that only have dozens (or fewer) tend not to be. Anecdotally, many of the field’s upper echelon researchers seemed to recognize that most published results were selective at best — it wasn’t always what was reported that was most telling, but rather what logical interim steps were likely taken that weren’t reported. Those omitted steps were often failed replications.

    • Mayo: “Despite the best of intentions of the new replicationists, there are grounds for questioning if the meta-methodology is ready for the heavy burden being placed on it.  I’m not saying that facets for the necessary methodology aren’t out there, but that the pieces haven’t been fully assembled ahead of time. Until they are,the basis for scrutinizing failed (and successful) replications will remain in flux.”

      Agree. Scientists don’t agree what replication is. They are unsure when to call a replication a success or a failure. Or whether that is even a legitimate criterion. And then there is robustness, and meta-analisis. A veritable alphabet soup.

      • Fernando: I actually don’t think the main issue is getting clear on the meaning of “replication” or “reproducible” in general, but rather, more specifically, in psychology. I’m not saying these notions are straightforward, even in other fields (witness how Anil Potti denied Baggerly and Coombes failed to replicate him because they didn’t duplicate his unreliable technique): http://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-potti-training-and-test-data-fireworks-part-1/
        — I happen to writing on this right now. I think the nature of the required critical appraisal in psych is fairly clear, but I don’t know if the psych researchers are willing (or even able) to carry it out. Perhaps a certain degree of tunnel vision is required for the field, perhaps it’s enough that they’ve instituted a certain degree of consciousness-raising.

  5. “I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences.” That brings it to the point. That ist is necessary and useful to widen one’s perspective beyond that binary thinking is one precious thing I’ve learned from this blog.

  6. To address Anne’s comment, “Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. … the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.”

    My experiences is that not looking at the distribution of variables before analyzing data is, sadly, very common. I looked at some of the replications in the special replications issue, and they did not give any evidence of looking at the data. But (again, sadly) this did not surprise me because I have seen it so often. (See some of my comments on this at http://www.ma.utexas.edu/blogs/mks/2014/06/26/beyond-the-buzz-part-iii-more-on-model-assumptions/).

  7. It is not clear to me why an original author should have the right to analyze the replication and review the manuscript before publication (this is what Schnall complained about). Once the original paper (and data) is published and enters the scientific dialogue, it is not the ‘property’ of the original author, but belongs to all. In the past, authors have answered to replications of their work with other articles, but after publication.

    Did you see the quite harsh twitter debate around the case? For example, the editors of the special issue were addressed as “shameless little bullies” and “replication police,” while others strongly support the replication project saying that some psychologists “don’t understand how important replication is.” Check out @BrianNosek, @DanTGilbert, @lakens a few weeks back. I copied the tweets onto my blogpost about the case:

    http://politicalsciencereplication.wordpress.com/2014/05/25/replication-bullying-who-replicates-the-replicators/

  8. Pingback: “Replication Bullying:” Who replicates the replicators? | Political Science Replication

    • “This case shows that there is still no established culture and procedures to reproduce published work.”

      When I was a young’in, I visited Oak Ridge National Laboratory to get a peak at a direct replication of the infamous cold fusion experiment which had only recently made public. It was one of a slew of similar replications which were being done in labs all over the world at the same time.

      The main discussion with the Experimenter centered on how the experiment labeled “simple” in the press was actually quite subtle and difficult to get right. Considerable care and expertise were needed for the replications.

      No one questioned these efforts. No one thought they were personal attacks of some kind. No one accused the replicators of bullying. No one thought the details where so difficult they were beyond replication in practice. Everyone instinctively recognized the legitimacy of it without any discussion at all.

      So when they say there’s no culture of replication, what they mean is, there’s no culture of replication outside of the hard sciences.

  9. A fragile effect might still be of interest to those who want to produce the effect, but it’s not of much interest to those who want to know how the system of interest behaves for the most part under ordinary conditions. Many responses to failed replication attempts ignore this point. In fields like moral psychology, an effect that requires very carefully controlling the conditions of the experiment might be of some interest to specialists, but it is not of much interest to the general public, and the specialists need to know about its fragility.

  10. Pingback: Replication controversies - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

Leave a Reply to Anon Cancel reply

Your email address will not be published. Required fields are marked *