Skip to content
 

Replication controversies

I don’t know what ATR is but I’m glad somebody is on the job of prohibiting replication catastrophe:

Screen Shot 2014-11-18 at 7.07.28 PM

Seriously, though, I’m on a list regarding a reproducibility project, and someone forwarded along this blog by psychology researcher Simone Schnall, whose attitudes we discussed several months ago in the context of some controversies about attempted replications of some of her work in social psychology.

I’ll return at the end to my remarks from July, but first I’d like to address Schnall’s recent blog, which I found unsettling. There are some technical issues that I can discuss:

1. Schnall writes: “Although it [a direct replication] can help establish whether a method is reliable, it cannot say much about the existence of a given phenomenon, especially when a repetition is only done once.” I think she misses the point that, if a replication reveals that a method is not reliable (I assume she’s using the word “reliability” in the sense that it’s used in psychological measurement, so that “not reliable” would imply high variance) then it can also reveal that an original study, which at first glance seemed to provide strong evidence in favor of a phenomenon, really doesn’t. The Nosek et al. “50 shades of gray” paper is an excellent example.

2. Her discussion of replication of the Stroop effect also seems to miss the point, or at least so it seems to me. To me, it makes sense to replicate effects that everyone believes, as a sort of “active control” on the whole replication process. Just as it also makes sense to do “passive controls” and try to replicate effects that nobody thinks can occur. Schnall writes that in the choice of topics to replicate, “it is irrelevant if an extensive literature has already confirmed the existence of a phenomenon.” But that doesn’t seem quite right. I assume that the extensive literature on Stroop is one reason it’s been chosen to be included in the study.

The problem, perhaps, is that she seems to see the goal of replication as a goal to shoot things down. From that standpoint, sure, it seems almost iconoclastic to try to replicate (and, by implication, shoot down) Stroop, a bit disrespectful of this line of research. But I don’t see any reason why replication should be taken in that way. Replication can, and should, be a way to confirm a finding. I have no doubt that Stroop will be replicated—I’ve tried the Stroop test myself (before knowing what it was about) and the effect was huge, and others confirm this experience. This is a large effect in the context of small variation. I guess that, with some great effort, it would be possible to design a low-power replication of Stroop (maybe use a monochrome image, embed it in a between-person design, and run it on Mechanical Turk with a tiny sample size?), but I’d think any reasonable replication couldn’t fail to succeed. Indeed, if Stroop weren’t replicated, this would imply a big problem with the replication process (or, at least with that particular experiment). But that’s the point, that’s one reason for doing this sort of active control. The extensive earlier literature is not irrelevant at all!

3. Also I think her statement, “To establish the absence of an effect is much more difficult than the presence of an effect,” misses the point. The argument is not that certain claimed effects are zero but rather that there is no strong evidence that they represent general aspects of human nature (as is typically claimed in the published articles). If an “elderly words” stimulus makes people walk more slowly one day in one lab, and more quickly another day in another lab, that could be interesting but it’s not the same as the original claim. And, in the meantime, critics are not claiming (or should not be claiming) an absence of any effect but rather they (we) are claiming to see no evidence of a consistent effect.

In her post, Schnall writes, “it is not about determining whether an effect is “real” and exists for all eternity; the evaluation instead answers a simply question: Does a conclusion follow from the evidence in a specific paper?”—so maybe we’re in agreement here. The point of criticism of all sorts (including analysis of replication) can be to address the question, “Does a conclusion follow from the evidence in a specific paper?” Lots of statistical research (as well as compelling examples such as that of Nosek et al.) has demonstrated that simple p-values are not always good summaries of evidence. So we should all be on the same side here: we all agree that effects vary, none of us is trying to demonstrate that an effect exists for all eternity, none of us is trying to establish the absence of an effect. It’s all about the size and consistency of effects, and critics (including me) argue that effects are typically a lot smaller and a lot less consistent than are claimed in papers published by researchers who are devoted to these topics. It’s not that people are “cheating” or “fishing for significance” or whatever, it’s just that there’s statistical evidence that the magnitude and stability of effects are overestimated.

4. Finally, here’s a statement of Schnall that really bothers me: “There is a long tradition in science to withhold judgment on findings until they have survived expert peer review.” Actually, that statement is fine with me. But I’m bothered by what I see as an implied converse, that, once a finding has survived expert peer review, it should be trusted. Ok, don’t get me wrong, Schnall doesn’t say that second part in this most recent post of hers, and if she agrees with me—that is, if she does not think that peer-reviewed publication implies that a study should be trusted—that’s great. But, from her earlier writings on this topic give me the sense that she believes that published studies, at least in certain fields of psychology, should get the benefit of the doubt: that, once they’ve been published in a peer-reviewed publication, they should stand on a plateau and require some special effort to be dislodged. So when Study 1 says one thing and pre-registered Study 2 says another, she seems to want to give the benefit of the doubt to Study 1. But I don’t see that.

Different fields, different perspectives

A lot of this discussion seems somehow “off” to me. Perhaps this is because I do a lot of work in political science. And almost every claim in political science is contested. That’s the nature of claims about politics. As a result, political scientists do not expect deference to published claims. We have disputes, sometimes studies fail to replicate, and that’s ok. Research psychology is perhaps different in that there’s traditionally been a “we’re all in this together” feeling, and I can see how Schnall and others can be distressed that this traditional collegiality has disappeared. From my perspective, the collegiality could be restored by the simple expedient of researchers such as Schnall recognizing that the patterns they saw in particular datasets might not generalize to larger populations of interest. But I can see how some scholars are so invested in their claims and in their research methods that they don’t want to take that step.

I’m not saying that political science is perfect, but I do think there are some differences in that poli sci has more of a norm of conflict whereas it’s my impression that research psychology has more of the norms of a lab science where repeated experiments are supposed to give identical results. And that’s one of the difficulties.

If scientist B fails to replicate the claims of scientist A who did a low-power study, my first reaction is: hey, no big deal, data are noisy, the patterns in the sample do not generally match the patterns in the population, certainly not if you condition on “p less than .05.” But a psychology researcher trained in this lab tradition might not be looking at sampling variability as an explanation—nowhere in Schnall’s blogs did I see this suggested as a possible source of the differences between original reports and replications—and, as a result, they can perceive a failure to replicate as an attack on the original study, to which it’s natural for them to attack the replication. But once you become more attuned to sampling and measurement variation, failed replications are to be expected all the time, that’s what it means to do a low-power study.

Background

OK, that’s the story. But for a bit more perspective it might help to see some of the things I wrote on this several months ago, as the basic issues haven’t changed.

One of the disputes had to do with a ceiling effect in her data that had not been noticed in the original experiment. Here’s what I wrote in July 2014:

On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.

But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge.

Schnall had wrtten, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to,” to which I’d remarked:

My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.

There’s also this, from my July 2014 post:

Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about replication per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to criticism.

But, thinking more about this, I don’t agree with Kahneman at all. What was I thinking? What was he thinking? “This behavior should be prohibited”??? That’s just wack.

124 Comments

  1. Steve Sailer says:

    Interesting findings in the human sciences are likely to stop replicating after a while. For example, in late 2004 I pointed out that the demographic measure that most closely correlates with the vote by state in the 2000 and 2004 Presidential elections is a measure I invented called Average Years Married for Younger White Women. States where white women in 2002 were most likely to be married between 18 and 44 vote the most Republican. The correlations were astonishingly high in 2000 and 2004.

    That replicated again in 2008 and superbly so in 2012: r = 0.88!

    http://www.vdare.com/articles/happy-white-married-people-vote-republican-so-why-doesnt-the-gop-work-on-making-white-peopl

    But eventually, that correlation is going to stop happening. Some political strategist is going to look at this and figure out a way to get the Democrats to appeal to married voters or the Republicans to appeal to single voters and that will make a hash of my beautiful finding. It’s precisely because it’s a hugely important finding that eventually somebody is going to work very hard to make it stop happening.

    Eventually, a replication study is going to find that my finding isn’t true anymore. If I’m still alive, I’ll be sad, but that’s life. I certainly won’t hold it against the people who did the replication study.

    • Andrew says:

      Steve:

      Indeed, some of the key findings of Red State Blue State did not hold up in 2012. Avi Feller, Boris Shor, and I tell the story here. I have to admit I was unhappy to see the non-replication. But, as I told me coauthors, if someone’s going to discover that the pattern I’d discovered, no longer holds, I want that “someone” to be me!

      • Rahul says:

        Regarding Steve’s last comment: Not holding it against people who did a study that failed subsequent replication.

        These are cases of the underlying trend itself changing, not just crappy measurement & analysis. What I would hold against such researchers is not recognizing that what they were studying was a very temporal phenomenon.

        • Steve Sailer says:

          Perhaps it would be diplomatic to offer researchers whose findings no longer replicate an honorable discharge: We’re sure the original findings were true in their time and place, but currently they don’t seem to be universally true.

          • Rahul says:

            The pesky problem is researchers capturing a local, ephemeral trend & selling it as a global, pervasive effect.

            • Andrew says:

              Rahul:

              No, it’s worse than that. It’s researchers picking up on a random pattern in data and selling it as being valid in the population. I have no reason to believe the claims in “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” for example, held even among American voters in that particular year or even the particular days the data were collected.

              • Steve Sailer says:

                Indeed.

                Still, I’m certainly no expert on diplomacy, but offering people who, say, cited a dumb paper some kind of face-saving way to retreat into silence on the subject with their dignity semi-intact might be useful.

                If X is false and Y is true, it would be great if everybody said Y, but it’s still an improvement if people who used to say X just stopped say anything.

      • Steve Sailer says:

        But that doesn’t mean it wasn’t true for the elections you studied.

        In the human sciences, it’s not just that things change, but but there’s can even be a Heisenbergian aspect — for really interesting questions, a good study may provoke the human actors into changing, indirectly or even directly.

        To make up an example, say you are a sportswriter and a year ago you published an article explaining “Why Peyton Manning and the Denver Broncos Offense Are Unstoppable.” You did a whole lot of work watching NFL game films and building computer models of their play calling. Your article provides the clearest explanation yet of how Denver’s record-setting passing game works. Your article becomes widely read, especially as Denver keeps winning. People start treating you with more respect. You are getting on TV as an expert on Denver’s offense.

        Life is good.

        Until … last winter’s Super Bowl, when Seattle completely shuts down Denver’s offense. You go into the office on Monday morning and suddenly nobody is calling for a quote. You look at Twitter and the only mentions of your name are derisive.

        Months later, you find out that the Seattle coaching staff found your article extremely helpful in planning their defense for the Super Bowl. They went through your article line by line.

        Objectively, that’s about the highest tribute a sportswriter can get for doing analysis. But, readers will probably still make fun of you over Denver being crushed in the Super Bowl.

        Likewise, I wouldn’t be surprised if you could have found a heavily marked up copy of Red State Blue State in the analytics department at the Obama HQ in Chicago. (Probably not at Romney HQ, though!)

        • Andrew says:

          Steve:

          I don’t have a good story for why the pattern changed, but, yes, I do believe we discovered something real, in part because we saw it in 2000, 2004, and 2008, indeed a key part of our book was discussing how that pattern arose in the post-1990 era. So from that perspective there was never a reason to believe it would continue forever, even without any feedback effects of the sort you discussed.

        • Keith O'Rourke says:

          > a good study may provoke the human actors into changing, indirectly or even directly

          That Heisenbergian aspect was what de_Saussure http://en.wikipedia.org/wiki/Ferdinand_de_Saussure gave as his reason for choosing to study linguistics rather than economics.

          Should be better known, as well as the need to periodically redo studies to check if things have changed (which Google analytics sales people apparently do pitch to get clients to do more studies).

          In our communal attempts to get less wrong about the world, we need to keep in mind that the world changes (especially the realm of human interactions).

          • Fernando says:

            Well presumably you can model the change, in which case the model does not change.

            If your model becomes obsolete after a month, that is not a very useful model (unless your horizon is measured in minutes)

            • Steve Sailer says:

              Here’s a general model that works in a whole lot of situations: the things we are most interested in are those that are particularly hard to predict because they match forces that are well-balanced. For example, Apple’s stock price go up or down tomorrow? Who will win the Super Bowl?

              Then there are things that are easy to predict that strike most people as really boring: When will the sun come up tomorrow? Will Beverly Hills have higher school test scores than Compton? Will men or women run the 100m faster at the next Olympics?

              There seems to be more science involved in the latter kind of predictions, but most people don’t find them very interesting.

              Stephen Pinker told me while he was promoting his Blank Slate book:

              “Mental effort seems to be engaged most with the knife edge at which one finds extreme and radically different consequences with each outcome, but the considerations militating towards each one are close to equal.”

      • Steve Sailer says:

        Dear Professor Gelman:

        You should try to get the 2012 Reuters-Ipsos American Mosaic panel of over 40,000 voters because the sample size would be 8 times the survey you used in this paper. A national sample of 5,000 provides an average of only 100 per state across all income groups, so octupling your sample size has a lot of advantages.

        Unfortunately, the 2012 data explorer tool isn’t online anymore, but somebody at Reuters or Ipsos must have the data set somewhere. I did a lot of work with it in early 2013 and it was fascinating. I believe it had household income breakdowns, but I don’t recall for sure.

    • Rahul says:

      Nobody needs to work on it. Sometimes correlations untangle themselves. Note that you are arguing on the basis of an effect seen in just four election years.

      People change. Thinking changes. Polling patterns change.

  2. Stuart Buck says:

    You could write another equally long post dissecting her Edge interview, http://edge.org/conversation/simone_schnall Her analogy to the East German Stasi is a bit hard to follow.

  3. sentinel chicken says:

    To understand the issues at play you have to understand the logic of conceptual replication and it’s relationship to the idea of constructs. It seems that researchers outside psychology, and those within the field promoting direct replication, don’t actually understand these concepts in the way that folks like Schnall and the replication skeptics do. There is an history, or at least a perceived history, of critics using monomethod as an argument against the validity of a study’s results. The argument is that a finding is simply an artifact of particular method and doesn’t provide any information about psychology in general. The conceptual replication is designed to cutoff this argument and allow one to make general statements about human psychological processes, in general. The ability of the conceptual replication to do this is based on the fact that psychological variables are constructs that can’t be measured directly, by definition. (Note that social psychologist use the stroop NOT because it replicates, but because of the construct it supposedly represents, which has been variously defined as executive function, conflict monitoring, etc.) So, to say something about human nature, you have to rule out monomethod bias as an alternative explanation. In theory, that’s what conceptual replication is designed to do. It practice, it allows for research degrees of freedom, or a garden of forking paths. And so, this whole debate really has it’s origins in one question: Is conceptual replication a way to learn about universal aspects of human nature or just a way for researchers to rationalize shoddy methods? The pro-replicators clearly reject conceptual replication while the replications skeptics are convinced of its value.

    There is an interesting irony in this whole schism. In your comments on these issues, you have suggested that researchers need to accept and embrace uncertainty and variability, and work to explain and understand it. However, it seems that the pro-direct replication crowd is actually the group most afraid of variability and uncertainty. They want methods that always work and never vary, like the Stroop. The pro-conceptual replication crowd is much more comfortable with uncertainty, which is why they are comfortable with the method of conceptual replication.

    • sentinel chicken says:

      Also, people get hung up on Bargh’s elderly prime walking speed study, but often overlook there are two other studies in that paper, designed as conceptual replications. I’ve yet to see anyone try to replicate or contest those studies. Anyone can find problems with a single study, but to debunk a finding in psychology, you need to rule out all the evidence for the relationships at play between variables that cannot be measured directly. The best way to do that is with theory, not method.

    • Fernando says:

      Sentinel Chicken:

      There is an history, or at least a perceived history, of critics using monomethod as an argument against the validity of a study’s results. The argument is that a finding is simply an artifact of particular method and doesn’t provide any information about psychology in general. The conceptual replication is designed to cutoff this argument and allow one to make general statements about human psychological processes, in general. The ability of the conceptual replication to do this is based on the fact that psychological variables are constructs that can’t be measured directly, by definition. (Note that social psychologist use the stroop NOT because it replicates, but because of the construct it supposedly represents, which has been variously defined as executive function, conflict monitoring, etc.) So, to say something about human nature, you have to rule out monomethod bias as an alternative explanation. [my emphasis]

      If you are worried about monomethod bias the solution is not conceptual replication. Rather, the solution is to state a theory of why you think the method in particular may be at fault, and to test for it using other another specific method that does not share the same problem.

      Conceptual replication is possibly the worst way to go about "monomethod bias" because it does not posit a theory of anything about the previous study. Put differently, it simply draws a new random sample of researchers, materials, and methods, to see if the finding still holds. For all we know we get a pretty similar sample. Surely it is better to do purposive sampling guided by theory to stress test the specific concern about the particular method.
      For example, if you are worried the functioning of a mercury thermometer at high pressure may have biased experimental results, you don’t repeat the experiment by taking another thermometer at random. You specify why the thermometer might have been wrong in this scenario, and if an electronic thermometer is robust to high pressure then use that to test the prediction. Conceptual replication does not do this. As far as I understand it, it simply states some abstract worry about monomethod and says: lets have another team repeat the experiment about the effect of concept A on concept B. If the problem is with some tacit knowledge that everyone shares then there is zero variance across samples in this tacit knowledge. If the concern is about tacit knowldege, then theorize what is it, try to make it manifest, and then see if people with this manifest knowledge can replicate finding relative to people who don’t get any info. Iterate until the problem is well understood.
      In short, scientists test theories. Variation for variation sake makes no sense, and is a waste of resources. A fuller discussion of why conceptual replication is possibly the worst form of replication here

      • sentinel chicken says:

        Fernando: I’m not worried about monomethod bias, “The Field” is. Unfortunately, your comments illustrate my concern that many voices in this discussion/debate fundamentally don’t understand the logic, domain, and practice of conceptual replication. Rerunning a study with a new, randomly selected mercury thermometer is not a conceptual replication. Neither is running it with an electronic thermometer. Why? Conceptual replication is only possible when you’re dealing with concepts, by definition. Temperature is not a concept, it’s a directly observable quality of the physical world. The temperature scenario you describe is the type of concern that one sees in fields where the variables of interest have one manifest form that can be observed directly. Psychology is largely concerned with variables you can’t measure directly and can manifest in a number of different ways. There is no thermometer for self-esteem, anxiety, or mood. Theoretically, these variables can be both state and trait, and can manifest in cognitions, affect and/or behavior that can theoretically overlap and share a lot of variance. I’d suggest reading Jared Diamond’s thoughtful and often overlooked article on this issue to get some perspective: http://www.jareddiamond.org/Jared_Diamond/Further_Reading_files/Diamond%201987_1.pdf

        But to be clear: Conceptual replication is fundamentally a theory-driven practice. It’s not variation for variation sake. It is largely driven by the fact that theorizing in psychology is fundamentally fuzzy and the variables being studied are difficult to pin down given the current state of theory and measurement. I agree with the sentiment of your comment: Theory is the key to progress in science. I’ve seen some replication crisis skeptics touch on this lightly, but it seems to be getting lost in the current vitriolic atmosphere. It seems to me that it’s the direct replicators who have really dropped the ball on theory. I have yet to see any of them make any theoretical arguments against the findings they are trying to debunk. That’s a real problem. Anyone can sit in their office and rip apart a study on statistical or methodological grounds. Coming up with a theory about why social priming or embodied cognition doesn’t exist is the scientific work. I’ve yet to see them do it.

        • “Temperature is not a concept, it’s a directly observable quality of the physical world”

          To a psychologist I imagine it looks that way, to a person doing nonequilibrium thermodynamics… not so much.

          Everything directly observable is essentially a “position” measurement. The position of a boundary between mercury and mercury vapor in a thermometer tube, the position (and hence induced voltage) of electrons in a capacitor connected to a thermocouple or a thermistor, the position of a dial on an old analog meter…

          You can’t “directly measure” temperature, in fact temperature is a variety of concepts:

          1) 3/2 of the average kinetic energy of translation of molecules in a system
          2) the rate of change of energy with respect to entropy in an equilibrium system
          3) A lagrange multiplier in a maximum entropy bayesian model of the state of a system given a known total energy.
          4) Some other concepts related to quantum mechanics that I’m not familiar with.

          In computational molecular dynamics we have to make choices about these different conceptions of temperature, because they give different results in non-equilibrium systems.

          The point is, whether a mercury vs electronic thermometer is a conceptual replication as you define it is dependent on whether you believe 1,2,3… are all equivalent for your purposes physically. The mechanism by which temperature induces a change in the position of molecules in a mercury thermometer is potentially very different from the one by which temperature induces movement of electrons inside semiconductors.

          Another way of saying that is: everything is a model, but some models are more precise and broadly applicable than others.

        • Fernando says:

          Sentinel:

          Conceptual replication is fundamentally a theory-driven practice. It’s not variation for variation sake. It is largely driven by the fact that theorizing in psychology is fundamentally fuzzy and the variables being studied are difficult to pin down given the current state of theory and measurement.

          As Daniel has pointed out pretty much everything in life is a concept in so far as, like Plato, we live in a cave and are only making inference about the unobservable work using the shadows in the cave. In political science we work with concepts like "democracy" and the concept and the theory are mutually dependent.

          No matter. My reading of how psychologists talk about conceptual replication is mostly as taking a random sample or people, materials, and methods. Maybe to some people, like yourself, it is a theory driven exercise, which is great. Ultimately, this simply highlights the semantic mess replication is in, with over 18 replication typologies, spanning 79 replication types

          I have yet to see any of them make any theoretical arguments against the findings they are trying to debunk. That’s a real problem. Anyone can sit in their office and rip apart a study on statistical or methodological grounds.

          When I do replications I am never motivated by the goal to debunk the original work. I do them because I want to understand how they got a result I think it is unlikely. the problem is that journals do not publish replications, and when they do they want a debunking.

          I have argued against this on the grounds that it simply invites specification searches. Instead, I propose procedural replications that simply try to teach us something about scientific practice, with a view to improving it, and with it the reliability of research findings. As a by-product you may get a "debunking" but that is never the goal.

        • Ann on a mouse says:

          Taking a study apart on statistical or methodological grounds *is* coming up with a theory about why the effect it purports to demonstrate doesn’t exist.

    • Ken Schulz says:

      I certainly don’t understand conceptual replication and direct replication as mutually exclusive, as you seem to argue; they are solutions to different problems. As you say, (successful) conceptual replication gives us confidence to generalize a phenomenon beyond the specific conditions under which it was initially observed. Direct replication establishes the robustness of the effect under conditions which match the initial case closely. Really, since exact duplication of the conditions of the original finding is never possible (you can’t step into the same river twice…), ‘direct’ is just a degenerate or trivial case of conceptual replication; one in which the differences in conditions are considered insignificant by general agreement.

  4. L Hamilton says:

    I take a different social science perspective in that replication across diverse datasets (surveys) and methods plays a central role in deciding to what extent previous, exploratory findings can generalize. But then you hit another problem because if the answer is affirmative then reviewers can complain that’s nothing new. I’ve had one paper rejected on these grounds because it used 29 datasets to replicate something I published a few years earlier using 2 (and felt that was out on a limb). Results were in agreement and now much more precise, which I thought noteworthy but the editor yawned.

    The “replication” in controversy here is a more exciting kind, apparently.

  5. Fernando says:

    Andrew: I assume she’s using the word "reliability" in the sense that it’s used in psychological measurement, so that "not reliable" would imply high variance
    Yes, but there can be reliable unreliability. That is processes, methods, etc. that reliably generate high variance. If you dive down a few more feet on this you get to the Frank Knight’s distinction between risk and uncertainty.
    "Replication can, and should, be a way to confirm a finding"
    Not only that, it should be used to study scientific practice, or how sicentists go about practicing science, with a view to imrpoving the reliability of the scientif method. This is important because as we have discussed before, the substnative contritbutions of any study are contingent on the methodological assumptions. This is why ESP experiments inform us about the poor quality of the studies involved and not ESP.
    _ The point of criticism of all sorts (including analysis of replication) can be to address the question, "Does a conclusion follow from the evidence in a specific paper?"_
    Agreed, but the problem is most papers do not report crucial information needed in asnwering such a question. Personally, my (extreme) position is that if it was not pre-registered then I dont think the conclusion follows. Esp conclusion of the sort, common in economics:"We have theory T, and we find strong support for it in the data", when perhaps the data were tortured into confesssion. I would buy: "We have theory T, and we have found a particular configuration of these data that are consistent with the theory". The former says the theory is probably, the latter that it is possible. Very different.
    _ expert peer review _
    When I buy a car I dont have an expert assess it. I trust that the manufacturer has followed ISO 9000 production standards and is willing to offer a money back guarantee. I’d like to see less experts and more process.
    it’s my impression that research psychology has more of the norms of a lab science where repeated experiments are supposed to give identical results
    The problem with this attitude is that it is unscientific, starting with the binary notion of failed vs succesful replication makes no sense. Unfortuntely, the problem we have is most of stats is focused on the one-shot study, and meta-analysis is not up to task. My sense is most people are not tought to think in terms of sequence of studies, and the implications thereof.

  6. question says:

    “Schnall, Haidt, Clore & Jordan (2008) showed across 4 experiments using different methods that feelings of disgust can make moral judgments more severe. According to Google Scholar, since 2008 this paper has already been cited over 550 times.”
    Disgust as Embodied Moral Judgment. Schnall, Haidt, Clore & Jordan (2008)
    http://www.dspace.cam.ac.uk/bitstream/1810/239313/1/Schnall%2c%20Haidt%2c%20Clore%20%26%20Jordan%20%282008%29.pdf

    This is bizarre. Lets look at one of their results.

    Overall, they want to answer the question: “Is Physical Purity Related to Moral Purity?”
    They do not attempt to asses those qualities directly but instead: “Is physical disgust related to moral outcomes?”
    The method to elicit disgust: “we exposed some participants to a disgusting smell— a commercially available “fart spray”— while they made moral judgments.”
    To assess “moral outcome” one question they asked people was: “How moral or immoral do you, personally, find consensual sex between first cousins to be? (1 = extremely immoral, 7 = perfectly okay)”

    So they sprayed either 0, 4, or 8 sprays of fart spray (n=40 each) into a garbage bag and asked people to answer that question while sitting in a room containing the smelly bag. They report only the mean/sd for each group:
    0 sprays: 2.67 +/- 1.53
    4 sprays: 1.9 +/- 0.93
    8 sprays: 2.4 +/- 1.69

    From this (along with some other results from other questions) they conclude: “mild-stink participants and strong-stink participants
    were both more severe in their average moral judgments than were control participants. The mild-stink and strong-stink participants did not differ.”

    I also note that they directly measured level of disgust:
    “Post hoc tests revealed that strong-stink participants were significantly more bothered by an unpleasant odor during the moral judgments (M = 3.13, SD = 0.20) than were mild-stink participants (M = 2.13,S D = 0.20), who in turn were significantly more bothered
    by an odor than were control participants (M = 1.10, SD = 0.20).”

    If I were a reviewer I would demand answers to these two questions:
    1) Why did 4 sprays of fart smell make people think first cousin marriage was immoral while 8 sprays did not?
    2) Where is the plot of disgust vs moral outcome?

    I wonder, do any of the 550 citing papers note these oddities?

    • Andrew says:

      Wow—fart spray! That’s a study I can relate to, it really deserves a blog post all on its own.

    • Chris Crandall says:

      You’ve chosen one of the five DVs for this experiment. If you add all five together (which is entirely reasonable and creates a more reliable measure of moral disapprobation), the means are:

      Control (0): 3.46
      Mild (4): 3.05
      Severe (8): 3.08

      WRT to your question 1, 4 sprays = 8 sprays, and ~= 0 sprays. Not a surprise there.

      Anyone can dig around in a study and find a modest anamoly in a paper. They should be frequent. But one shouldn’t spend too much time trying to make hay with them. Certainly not 365 words worth.

      • question says:

        Chris,

        All anomalies need to be mentioned and discussed by the authors of the paper. Omitting that part of the paper is dishonest and really is creating an obstacle to figuring out what is going on. If anything, the anomalies are the most important part of the paper. Why would someone include a dose-response but then fail to mention it when there is no apparent dose-response? Some explanation for this needs to be present, whether it is convincing to the reader or not is up to them.

        Second, here is the best I can do regarding the plot missing from the paper. It should be of each individual data point but we will make due with the averages of averages:
        http://s14.postimg.org/ttt130oz5/moral.png

        “Post hoc tests revealed that strong-stink participants reported feeling significantly more disgusted (M = 2.38, SD = 1.31) than did mild-stink (M = 1.50, SD = 0.78) or control (M = 1.33, SD = 0.66) participants. Contrary to prediction, mild-stink participants’ disgust levels did not differ significantly from those of control participants.”

        If there is a ceiling effect as you suggest (“4 sprays = 8 sprays”), and the difference is due to amount of disgust, then why is it claimed that disgust levels *do not* differ between mild-stink and control?

        From that data (made obvious in my plot) the entire effect seems to occur between 1.33-1.5 level of disgust (on a scale of 1-7). I don’t find it plausible that people will think that sex amongst cousins is “more immoral” because someone in the room farted. Especially since it apparently didn’t even smell that bad.

        • Chris Crandall says:

          There should be “anomalies” all the time. I’m not going to contest your representation. But minor perturbations do not need discussion *every time,* because they are most often due to random variation that one should expect from sampling.

          If we had to discuss everything that deviated from a model, we’d all be tedious (as authors) or bored (as readers). One must assume a certain amount of sophistication for one’s readers. Otherwise we’d all be reading about seeing Jane run.

          • Andrew says:

            Chris:

            Sophistication of readers is a good idea, but as long as Psychological Science is going to publish papers like, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle,” we’re going to need some serious external peer review. In this case the authors and editors of the journal show an unfortunate lack of sophistication, and it is up to the readers to straighten out the record.

          • question says:

            Chris,

            “minor perturbations do not need discussion *every time,* because they are most often due to random variation that one should expect from sampling.”

            It seems according to your approach people can just pick and choose what anomalies they choose to mention. So what they will do is: when the anomaly supports their favored theory they mention it, when they don’t it is chalked up to “random variation” which is too boring/tedious to talk about. If the theories can only predict higher/lower or no/some relationship (are consistent with 50-99.999% of the possible outcomes), I do not see how any progress could be made at all. People will just measure their opinions.

            • “People will just measure their opinions.”

              It seems to me that to a great extent statistical inference is used for exactly that purpose. If a researcher is studying a question in, e.g., IQ studies, psychology or psycholinguistics or linguistics (this is especially bad in linguistics), how come he/she only finds evidence that supports his/her particular position? How come he/she never once finds a result, an accidental result, that goes against their theory?

              Related to the other thread in this discussion, about whether anomalies in a result should be discussed, many journals simply reject a paper that cannot explain the patterns in a neat manner. This encourages the author, desperate for a publication, to hide or ignore important details. I’m still trying to figure out how to publish a paper with a result that suggests a range of possible explanations, some more plausible than another. A probability distribution of possible theoretical accounts, if you like. So far, only limited success in open access journals. Elsevier-controlled journals wants clear, crisp conclusions. This desire for unequivocal results leads to a lot of distortion. Somehow I doubt I’m the only one out there with ambiguous results, experiment after experiment. By the way, sometimes I can’t even replicate my own experiments.

              One solution I like is to collaborate with your scientific opponent. That’s straighten up both parties.

              • question says:

                “Related to the other thread in this discussion, about whether anomalies in a result should be discussed, many journals simply reject a paper that cannot explain the patterns in a neat manner. This encourages the author, desperate for a publication, to hide or ignore important details… Elsevier-controlled journals wants clear, crisp conclusions. This desire for unequivocal results leads to a lot of distortion. Somehow I doubt I’m the only one out there with ambiguous results, experiment after experiment.”

                Regardless of where the blame lies, I predict the eventual solution decided upon by many is to throw out decades of literature that primarily resulted from such practices (hiding/ignoring important details). There may be some good stuff, but the amount of dis/mis-information is so large, it is easier to just throw it all away and start over than to sift through it. Of course each person would save a few “valuables”. There could be a tv show about it like “Scientist Hoarding”.

                “I’m still trying to figure out how to publish a paper with a result that suggests a range of possible explanations, some more plausible than another. A probability distribution of possible theoretical accounts, if you like.”

                Is it really that difficult to say “Model/explanation A has these merits/drawbacks”, “Model/explanation B has these…”, etc?

    • Student says:

      In fact, I think that just summing up the item responses is not a good idea and have been a bad practice in the area of psychology.

      • question says:

        I also agree with this. For example what meaning does the average have of these two things:

        “How moral or immoral do you, personally, find consensual sex between first cousins to be? (1 = extremely immoral, 7 = perfectly okay)”

        “Controversy has erupted over a documentary film about Mexican immigrants. The film has received excellent reviews, but several of the people interviewed in it have objected that their rights were violated. The filmmaker deliberately had his camera crew stand back 15 feet in a crowd so that some interviewees did not realize they were being filmed. Because the camera was not hidden, the procedure was legal. What do you think about the studio’s decision to release this film, despite the aforementioned allegations? (1 = strongly disapprove of film release, 7 = strongly approve of film release)”

        The average of the “immorality of cousin sex=3” and “disapproval of a movie=4” is 3.5. What does that mean? What units does this number have?

        • The fact that psych measures often have no units is, at a fundamental level, one of the *reasons* that psych studies can’t replicate reliably.

          Since they are unitless, but they describe a physical outcome, they must be functions of dimensionless ratios of quantities that have dimensions, these would typically be something along the lines of “ratio of the concentration of neurotransmitters in a particular set of synapses divided by the time averaged mean concentration of neurotransmitters in the period prior to the start of the experiment”, and the “functions” involved could be highly nonlinear, have horizontal asymptotes, etc and vary considerably from person to person based on the arrangement of synapses and soforth.

          The literature on “improper linear models” in which outcomes are scaled by their observed variation and added up with coefficients equal to 1 or -1 depending on the prior knowledge of the direction of the effect are the only real justification for such averaging of responses on 1-7 scale or whatever. However, there is much to be said for such literature. If you’re looking for underlying consistency across many questions, you can find it… but you can’t really believe much about any individual measurement, nor can you interpret the size of the effect across studies in any way.

          • question says:

            “you can’t really believe much about any individual measurement, nor can you interpret the size of the effect across studies in any way.”

            I am skeptical that we can learn much from these types of measurements. They are averages of snapshots of dynamic systems. I think a superior approach would be to study how individual responses vary with time and environment. Then we could examine simpler systems for similar patterns and use that to guess what kind of process is occurring. From this we could make predictions that, while not believed to be literally true, could be judged as more or less consistent with the data relative to other possibilities.

  7. Rene Bekkers says:

    One kid to the other in the sand box: “I am a true Sand Master. Yesterday I built this huge castle with towers, three feet tall.” Other kid: “Wow, how did you do that? What tools and sand did you use?” “Secret of the Sand Master.”

  8. artkqtarks says:

    “Replication” in the ATR paper refers to the process of copying DNA. That’s a basic biological process that needs to be controlled in order for cells to divide normally.

    I have a question that is a little off-topic. (That is, it is not about replication per se, but is related to use of statistics in biomedical sciences.) What do you think about the controversy discussed in the following articles:
    http://www.the-scientist.com/?articles.view/articleNo/41239/title/Epigenetics-Paper-Raises-Questions/
    http://blogs.discovermagazine.com/neuroskeptic/2014/10/16/inherited-too-good-to-be-true/

    Just to summarize, a pair of researchers published a paper last year in which they conditioned male mice to fear certain odors and found that their offsprings were born with the fear of (or at least sensitivity to) the same odors. Recently a commentary was published critiquing the original study. The basic criticism seems to be that even if the phenomenon is real, it is highly unlikely to get the kind of results that were published in the original paper.

    My personal opinion is that the data in the original data seem very noisy, the effect seems weak even if it exists, and it is highly unlikely that there is a mechanism that makes the phenomenon possible. Essentially I don’t believe the original paper. And I think I understand the gist of the critique. But I’m not sure if I’m entirely comfortable with the critique, either.

    Here is the link to the original paper:
    http://www.nature.com/neuro/journal/v17/n1/full/nn.3594.html

    Here is the critique:
    http://www.genetics.org/content/198/2/449.abstract

    • question says:

      “OPS of adult offspring. Mice were habituated to the startle chambers for 5–10 min on three separate days. On the day of testing, mice were first exposed to 15 startle alone (105-dB noise burst) trials (leaders), before being presented with ten odor + startle trials randomly intermingled with ten startle-alone trials. The odor + startle trials consisted of a 10-s odor presentation co-terminating with a 50-ms, 105-dB noise burst. For each mouse, an OPS score was computed by subtracting the startle response in the first odor + startle trial from the startle response in the last startle-alone leader. This OPS score was then divided by the last startle-alone leader and multiplied by 100 to yield the percent OPS score (% OPS) reported in the results. Mice were exposed to the acetophenone-potentiated startle (acetophenone + startle) and propanol-potentiated startle (propanol + startle) procedures on independent days in a counter-balanced fashion.”

      I just looked at the figure 1. It sounds like something like the following occurred: “F1-propanol” mice were exposed to propanol on day 1 and “F1-acetophenone” mice were exposed to acetophenone on day 1. Then both were exposed to the other substance on day 2. They way they calculated the score normalizes to the “the last startle-alone leader”. So if the mice were less startled without any treatment (at the end of the “leader” trials) on the second day it would appear to be a treatment effect but actually they just measured mice getting used to the environment.

      They probably tried out different comparisons before deciding on only using the “first odor + startle trial”. Why would they do 10 of them then only use the first? The sample sizes also appear to differ (ie different number of points in figures 1c and 1d) for no explained reason. Another confound is that the mice are coming from different cohorts (different parents) which could cause them to be different in any number of ways. From their description it seems possible that the animals were housed according to treatment group, whatever goes on in those cages (dominance, etc) could also make them more or less susceptible to startling.

      If I were to look at it closer I would try to figure out whether the histology came from the same animals as the behaviour. If not, why not? If so, and they want to claim two things are related, why did they not plot one vs the other?

  9. Erin Jonaitis says:

    I definitely would not say that research psychology in general is collegial. It may well vary by subfield.

  10. jrc says:

    “We know how easy it is for any study to fail. There is almost no limit to the reasons for a given experiment to fail and sometimes you figure out what the problem was, you made an error, there was something that you didn’t anticipate. Sometimes you don’t figure it out. There are always many reasons for a study to go wrong and everything would have to go right to get the effect.”

    As an admitted outsider who probably is “guilty” of all of the things Dr. Schnall accuses pro-replication people of (not believing people just because they published something, putting more weight on multiple replications than the original study, lack of belief that the peer-review process mostly generates “truth” – oh, and of telling people in other disciplines what is good for them, obviously) let me offer an alternative statement of this:

    “We know how hard it is for us to get statistically significant results when we do a study. Often times, when we don’t, we try to come up with some reason why it might have failed to produce statistically significant results. Then we tweak the study and try it again. So we conduct studies over and over, constantly adapting our experimental methods until we do find that statistically significant result”.

    And then, as an amateur statistical epistemologist (and professional quantitative researcher) I might point out: suppose a null hypothesis that none of the “treatments” you apply have any real effect. How is the method described here not likely to produce statistically significant results that are totally spurious, at least 1 out of 20 times (and probably more)? And if you only publish results from 1/20 studies that you run (with the big assumption that your rejection rates on placebo treatments are really .05)….

    One of the things I have been hoping to see is an estimate of the fraction of all social psychology experiments that actually produce results that get published. I mean – just tell me how many undergraduates were run through a lab, and the sum total N from all of the experiments published from the lab.

    Also – this one is pure gold – on the new post-Stapel mindset she finds worrisome:

    “We need to look for the fraudsters; we need to look for the false positives.”

    The idea that any serious quantitative researcher would equate the term “false-positive” with “fraud” is particularly aggravating to me. If you publish twenty papers with p-values of .05, one of them is likely to be a false-positive. That does not make you a fraud, it actually makes you a very careful researcher who is producing realistic p-values. In that sense, one hallmark of a researcher who has made a big contribution to their filed is the fact that they have published false-positive results.

    • Noah Motion says:

      If you publish twenty papers with p-values of .05, one of them is likely to be a false-positive.

      If you run twenty tests with alpha = 0.05 and the null hypothesis is true, you’re (reasonably) likely to get a false positive. It doesn’t follow from this that one of every twenty (published) tests with p < 0.05 is a false positive.

      • Right, more likely a much higher percentage of published papers are false positives! ;-)

        For example, suppose you run N experiments where a simple null hypothesis is true in each one, and then you publish papers on the N/20 of them that have significant findings… 100% of published papers in this literature are false.

        The question we have to ask ourselves is what fraction of the literature has this character? My suspicion is that it’s far too high for my comfort level (to put it diplomatically).

        • This is not a realistic possibility for the p-value enthusiast. Nobody would realistically run 20 experiments and then publish the one experiment that was significant. I find it hard to believe that anyone would do that; it’s not even logistically (financially) possible to do that. Much more likely is that a null result is quietly put away—that’s what I am forced to do. Some people do publish null results enthusiastically, but they do this to conclude an absence of an effect.

          • Martha says:

            Shraven,
            Yes, it is unrealistic that someone will run 20 experiments. But it is common for a researcher to do 20 or more hypothesis tests on one data set, and maybe even report all, but highlight those that are “significant,” without accounting for multiple testing. This is why preregistration (and also accounting for multiple testing) are important.

            • Not to mention just “one” experiment where you sequence 30,000,000 RNA molecules in a sample, which come from about 20,000 possible genes which each on average have 1-5 alternative splicings… and compare to a control…

              You might have 1 biological sample in experimental condition, one in control, 20,000 genes, 80,000 alternative splicings, and 30,000,000 reads, How many experiments is that?

              the interpretation of the p value depends strongly on whether you pre-registered say 10 genes you planned to look at or not. Fortunately for the biology community variations on this kind of data are now available on big archive sites, so you don’t even have to get your gloves wet.

            • I doubt that pre-registration and accounting for multiple testing will be adopted as a standard practice. At least in psych-type of disciplines, a typical analysis involves a lot of statistical tests before one settles on the one(s) to report. Would the researcher do only the pre-registered statistical tests? Who will know if they go beyond those? I’m not yet clear on how pre-registering is going to work out for me, but I want to try it.

              Related: are there any journals out there in psych where I can submit a pre-registered study and have it accepted on the condition that I do only the analyses I said I’d do?

          • Fortunately researcher degrees of freedom have substantially reduced the cost of doing significant research.

            • Let’s do a Bayesian calculation and base it on the Begley and Ellis 2012 Nature results (53 replication attempts in cancer research, only 6 were replicable at Amgen. Let’s just pretend that the replication failure means that the result is actually not a generally applicable true fact… there’s problems with this model, but it is an explicit place to start).

              Now misusing p notation for both Bayesian and frequency purposes, and implicitly conditioning throughout on the fact that the research is published in a top journal…

              p(p < 0.05 | null hypothesis is true) = p(null hypothesis is true | p < 0.05) p(p < 0.05)/p(null hypothesis is true).

              On the left we have a question about "how likely does p < 0.05 indicate a false effect", on the right, the first term is evidently around : p(null hypothesis is true | p <0.05) ~ 47/53 ~ 0.887 in cancer research, based on Begley and Ellis at face value, and p(p<0.05) given that it's published is very close to 1 (given published is implicit all around, I leave it out to save typing)

              since the left hand side can't logically be greater than 1, we conclude that p(null hypothesis is true | published) = O(1) in the denominator on the right, and that p values are essentially meaningless (the left hand side is O(1) not O(0.05))

              Your field may vary, you may have covariates such as reputation or experience with a research which allows you to screen out the junk, but if you just use a random number generator to select cancer papers…. at first glance…. it doesn't look good.

              • Noah Motion says:

                My point above was that Pr( false positive | p < 0.05) != Pr( false positive | H0 is true).

                As for the calculation you did, I don't think this it's true that "p(p<0.05) given that it's published is very close to 1". Many, many papers report lots of p values, any number of which may be greater than 0.05 (or whatever threshold is used). It's probably true that there is at least one p<0.05 in any given published paper (assuming they're reporting p values at all), but that's very different.

              • Noah: presumably the p values < 0.05 are the ones the researchers are using to argue that the associated effect is non-zero, so a table of 5 p values with 1 of them < 0.05 will typically be used to argue for that specific effect. Since the question is really "are the effects that are claimed by the researchers to be real, actually real" I think my calculation is still indicative of something important (though I am perfectly happy agreeing that it's a very back-of-the-envelope model).

                In massive statistical screening type experiments there is usually some effort to figure out "false discovery rate" and/or do corrections to p values (bonferroni or otherwise).

        • question says:

          I think this problem is a red-herring in most use-cases. When the null hypothesis is that two groups of people/animals are exactly the same it IS false. The real issue is finding the simplest and most plausible explanation for the type/amount of difference.

          See for example this question:
          http://andrewgelman.com/2014/11/19/24265/#comment-199338

          Regardless of any other problems, it is pretty much impossible to interpret those results because we have no info on how much to expect mice from two different sets of parents to differ from another.

          • My point is basically that p values don’t have the practical meaning attributed to them, your point is that p values aren’t even theoretically well grounded. I agree with both. You could make a theoretical improvement by defining a range of values with “practical equivalence to zero”. For example if in people with acne having say 100 pimples on average your drug changes the number of pimples by on average say 10 or fewer with relatively small variability, I would call it not practically helpful even though there is a clear effect.

          • Noah Motion says:

            When the null hypothesis is that two groups of people/animals are exactly the same it IS false.

            The null hypothesis is (supposed to be) a statement about the population, not any given sample(s). So, e.g., if H0 is that mu_a = mu_b, this is a statistical statement of the substantive (i.e., non-statistical) claim that samples a and b are from the same population. The whole point of figuring out how a test statistic is distributed under the assumption that the null is true is that we don’t expect the means from our two samples to be exactly equal.

            • Suppose that your hypothesis is that Latino voters vote differently from non-Latino voters on issue A. Is there any point in even questioning the assumption that if you found ALL the Latino voters, and ALL the non-latino voters and polled ALL of them that the proportions would turn out to be different in at least say the 5th decimal place? It’s just too unlikely that groups of several million people would have exactly the same proportion of “yes” voters on issue A. In most cases it’s probably not even mathematically possible.

              Similarly for most any other issue. However, practically speaking, a difference in the 3rd or 4th decimal place is probably irrelevant to any substantive question.

              • Noah Motion says:

                Maybe this is a case where NHST is inappropriate, but, if so, this points to a limitation of how general NHST is, not an inherent problem with its logic.

                In any case, it seems like maybe you could reason that, yes, these two groups likely differ to some degree in the population, but do they differ more than any other partition of the population into two groups?

            • question says:

              Noah,
              I never understood that sampling from a population description. If I look at the t-test I see that I am literally comparing my result to a “possible” result with 0 difference between sample means. There seems to be some sort of circularity going on by estimating the population parameters from the sample, then saying the null hypothesis is a statement about the population. I can’t put my finger on it. Besides that I find the following arguments against “checking if two groups are different” compelling:

              1) Most uses of NHST are not with random samples of a population, in fact it is usually not clear what population could be referred to. When people run experiments with rats they order from some company, what population are they sampling from? And yes there are people who will only run certain experiments with rats from “The top shelf of rack 4 from Room 3340C in the french facility”. We have no idea how common such behavior is.

              2) If you take any two groups of people/animals and compare them along enough dimensions you will find a real difference between them. This is not a false positive, they really will be different (“from different populations”). The only reason to not find a difference is lack of imagination and/or funding.

              3) When a treatment has been tested, the researchers have made sure the two groups differ. We know that they differ in at least one way. Just because they are different in some other does not mean the treatment works like they theorize it does.

              4) Also, other important factors could have differed between groups as well (at baseline or having evolved during the study). Usually we do not even know enough to be sure what factors to check. We do not understand the system well enough to be sure we are using all appropriate controls.

              • Noah Motion says:

                question, I, too, find those arguments pretty compelling. I think there are lots of cases that one or another null hypothesis significance test is inappropriate for. It seems pretty clear to me that lots of people blindly apply statistical tools without thinking carefully about the assumptions that the tools rely on.

                I think in your #1, the population probably *is* something like “The top shelf of rack 4 from Room 3340C in the french facility”, which likely has very severe implications for how general any findings with samples from that population are (i.e., the findings may well generalize to only that population and no other).

                For #2 and #3, I think the dimensions/characteristics/factors people intentionally ignore or don’t know about constitute (a large part of) the error in whatever statistical model is used. I don’t have a good sense of how often these neglected factors are well modeled by, e.g., normal error, and it doesn’t seem like people probe their models’ assumptions all that often (I know I’ve been guilty of this more often that I should be).

                With regard to #3, right, but this isn’t a problem with NHST per se, I don’t think. It’s a potential problem with whatever links there are between non-statistical hypotheses and statistical hypotheses. It seems to me that this problem applies to analyses that focus on estimation and modeling rather than testing, as well.

                For that matter, #1, #2, and #4 are issues for non-testing-based approaches, too. To the extent that we want to generalize beyond our sample, we are restricted to generalizing to whatever the appropriate population is. And anything we don’t explicitly model ends up (at risk of) being treated as error, whether or not that’s our intention.

      • “If you run twenty tests with alpha = 0.05 and the null hypothesis is true, you’re (reasonably) likely to get a false positive. It doesn’t follow from this that one of every twenty (published) tests with p < 0.05 is a false positive."

        Isn't this what was meant (below)?

        pvals<-rep(NA,1000)
        for(i in 1:1000){
        x<-rnorm(1000)
        pvals[i]<-t.test(x)$p.value
        }
        mean(pvals<0.05)
        ##[1] 0.046

        • No, I think what was meant was that a given researcher might run say 200 experiments in their lifetime, let’s say they’re careful to choose good topics so say 10% of them have true null hypotheses (20) and of those 1 will show as significant at random and be published…. so the probability that a reasonably careful researcher (one who only looks at null effects 10% of the time) has at least 1 false-positive published null effect is pretty high.

          • Andrew says:

            Daniel:

            As I’ve written elsewhere, I think the problem is in the framing of hypotheses as “true” or “false.” Again take the notorious paper, “The Fluctuating Female Vote: Politics, Religion, and the Ovulatory Cycle.” What’s the hypothesis in that paper? Is it true or false? It’s hard to say. With a preregistered replication things are much clearer because the statistical test is decided ahead of time. But for un-preregistered papers, I think the whole idea of “true” or “false” falls apart.

            • I agree, I was more trying to clarify what I think jrc meant above.

              In other comments I’m trying to defend the idea that a Bayesian, looking at a research field dominated by NHST, knowing what we do about how the data (the p values) are produced… should probably conclude that it isn’t possible to rule out the idea that huge swaths of research have taught us nothing about the world, that unlike the p < 0.05 meaning maybe only 5% of research is wrong, we can easily find data and data-collection models that cause us to conclude that 40,50,70,even say 90% of published research in some fields is not informative about the topic at hand (more informative about biases in the field perhaps).

              • jrc says:

                Yeah, people always have to clarify what jrc was saying, because jrc often says technically sloppy things like “If you publish twenty papers with p-values of .05, one of them is likely to be a false-positive.”

                Introductory Annecdote: So a couple of weeks ago I’m standing in front of a dozen graduate students and I ask “What is a p-value.” That was met with complete silence.

                Mea Culpa: I was unclear and imprecise. Daniel and Noah both provided more articulate formulations. They are all helpful.

                Noah: Pr( false positive | p < 0.05) != Pr( false positive | H0 is true) – this is a concrete and true point. The false-positive with respect to a true null hypothesis rate is .05. And it is helpful. What I described wanting was the right side (though with p=0.05, not less than). So under this interpretation of what I said (which is what reasonably follows from the context) what I said was flat out wrong.

                Daniel: Daniel points out that, even given the fact that my implied calculation was wrong, the statement is probably true. That is, given the way we calculate p-values, and the nature of the variation in many of these experiments, and problems with multiple comparisons and un-acknowledged experiments, at least 1 out of 20 published p-values indicating "no effect" is probably wrong.

                Daniel 2: Daniel two then goes on to point out, in conjunction with our host, that the whole framework here is silly and the concept of "test" is unclear. And maybe they'd add that in many cases testing for "no effect" is meaningless, and even though a pre-specified test might be give us something about which to judge the truth or falsity (the statement of the null hypothesis), it still doesn't mean the thing being considered true/false is necessarily interesting or that judging its truth explains very much.

                jrc reformulated: Suppose I publish a bunch of papers in my lifetime. I expect that in 5% of my papers the confidence interval I produce around my estimate will not cover the true parameter*. So in that sense, if I actually believe in the statistical theory, I expect to be "wrong" about 5% of the time, in that the true mean lies outside of my 95% confidence interval with that frequency.

                new jrc unbheolden to the old one: What really got me was the idea that a false-positive is somehow equated with fraud. In any world where statistics and probability are used to make conclusions about reality, we expect our methods to miss the truth some non-negligible proportion of the time, even in the simplest settings with randomized variation. We can do better and worse at improving our procedures for statistical inference (including being very interested in the magnitude and precision of our estimates and thinking carefully about what our various statistical tests are telling us or not telling us) but the whole nature of the game involves uncertainty.

                *if I use standard error estimates that produce reliable confidence intervals… but that sounds like begging another set of questions.

                **thanks Andrew, your harping on these points about variability and uncertainty has pushed me to think about it in these terms, and I've found the exercise helpful, empirical evidence above regarding imprecision of my thinking aside.

          • Hi Daniel,

            so I did the simulation that is implied by your post above. I simulated 1000 scientists, each doing 200 experiments over their lifetime, with a 20% chance of sampling from a distribution where \mu=0. Say the true difference when the null is false is 2 and SD 50 (unrealistic since each expt will be something different, but OK here because we keep power steady at one low value, which is the situation in most publications).

            The min and max range of proportion of false positives I get under these conditions is 0.01 to 0.19, mean is 0.06.

            If I run higher power studies (true mean is 5 in expts where the null is false, power about 60%), the min and max range of proportion of false positives I get under these conditions is 0.006 to 0.05, mean is 0.02.

            What I take away from this is that I need to be running high power studies even if I am trapped in a world governed by p-values. That’s the most I can do to not waste everyone’s time and money (although what I *have* done in my research is abandon p-values, after Stan came onto the scene).

            Here are some quick power calculations:
            power.t.test(n=1000,delta=2,sd=50) ## power 14%
            power.t.test(n=1000,delta=5,sd=50) ## power 60%
            power.t.test(n=1000,delta=10,sd=50) ## power 99%

            My code is here:

            https://gist.github.com/vasishth/4517aa9c88ec6eb61e97

            • Sure, and I’m glad to hear that you’ve abandoned p values in favor of more modeling (which is really what Stan buys you). jrc’s point will be borne out in your simulations though, what is the frequency with which a given researcher produces at least 1 false positive in his or her lifetime even under the high-powered version? It’s probably a very large percentage (of researchers with at least 1 false positive), since mean false positive rate under high powered conditions was 0.02 and there are 200 experiments each researcher, we expect about 4 false positives per researcher, the number of researchers with none is going to be certainly less than 50% maybe more like 10% or something.

              but, and this is critical to jrc’s point, none of those researchers were doing anything *wrong*, *fraudulent*, or *evil* and that is what Schnall was equating lack of replicability with.

              • Yes, under the conditions I specified, the simulation shows that 96.6% of the 1000 simulated scientists are going to produce at least 1 false positive in a lifetime of 200 experiments. And this number doesn’t change even if I always run high power studies.

                Revised code is here for the current test of what proportion of scientists got at least one false positive:

                https://gist.github.com/vasishth/14192ae70a4ab4d7c56a

                I wonder what would happen if I add at least one replication of each experiment (to get three p-values in the two runs of an experiment) to count as a significant effect? I will check that and get back.

              • I continued with my simulation as follows: I set it up so that the simulated scientist cannot declare significance unless he/she can demonstrate the significance effect in one further replication (so he/she has to get p<0.05 twice in a row to publish, in two independent studies).

                Now, for 1000 scientists running 200 experiments each (well, now they have to run many more in reality), the proportion of scientists declaring false significance more than one is 0.001, as opposed to nearly 1 earlier. This is with low power studies as defined in my simulation (about 16% power).

                So, the best thing people using p-values can do is re-run their experiment as soon as they get a significant result, and publish only if both results are significant (and in the same direction).

              • Just one more comment: I wrote:

                “So, the best thing people using p-values can do is re-run their experiment as soon as they get a significant result, and publish only if both results are significant (and in the same direction).”

                That’s not quite right. At least in my small simulated world, the researcher has to run high power studies AND make sure they get at least one replication done before they publish a significant result. With low power studies they’d get almost nothing significant over their lifetime (hence the absence of false positives in my last simulation, which requires a replication), will be able to publish almost nothing.

              • jrc says:

                This blog is so much fun. Thanks Daniel for pushing the comments, and thanks Shravan for the simulations. That whole “if they do low power studies and require two significant results in a row to publish, nothing will ever publish” bit is really entertaining. I think the next time I re-do my power calculation lecture I’ll add in a set of simulated scientists doing various kinds of studies with various “publishing requirements”. Thanks for the idea.

              • jrc, you may also find this simulation interesting and useful in your classes:

                http://xianblog.wordpress.com/2014/05/09/stopping-rule-impact/

                Many psycholinguists (and presumably psychologists?) generally don’t even hide the fact that they run experiments until they hit significance.

  11. Chris Crandall says:

    This post may miss Schnall’s point that the contentious behavior that is accepted for political science *isn’t* what Schnall is complaining about. She’s complaining about all the work she has to do to service a cadre of researchers who want her cooperation for materials and procedures. She doubts their good will, she doubts they’ll work hard to publish “successful” replications, and she doubts her efforts are best spent taking care of their needs.

    • Andrew says:

      Chris:

      I don’t see why Schnall needs to service a cadre. She published a paper, presumably the paper describes its methods clearly enough that the study can be replicated, and anyone who wants can re-run the experiment. And people can interpret the new results however they want, just as they can interpret the original results. The original paper gets extra credit for originality, but a pre-registered replication has its own virtues. It’s good to have both.

      To put it another way, what if a researcher retires or moves to Australia or whatever. Someone should still be able to replicate his or her findings. The Stroop effect didn’t disappear just because Stroop died.

      • In biology to replicate things would take a lot of “hands on” knowledge that isn’t typically in the paper, as well as a host of materials that may no longer be available (such as particular genotypes of mice, particular molecular biology constructs, particular stains, particular machinery, etc).

        Typically what’s published in biology papers is enough to determine whether the protocol makes sense at a high level, what it’s testing, whether proper controls were run, etc, but not enough to actually re-perform the experiments.

        I imagine that in many areas of psychology it’s possible that a fair amount of “hand holding” would be required to replicate an experiment as well.

      • Chris Crandall says:

        I wish it were true. But just as a sophisticated recipe from a cookbook by a fine French baker does not guarantee success, so too do replicators need benefit from individual advice and description of tacit knowledge. If we had to put *everything* into the journal write-up, the article would be much, much longer.

        • Andrew says:

          Chris:

          You may be right, and this is another good reason to have pre-registered replications, so that we can learn that a certain effect, believed by many to be generally true and an important aspect of human nature, only seems to hold under some very special conditions. That’s good to know, as science studies are commonly presented, both by their authors and by the news media, as representing quite general phenomena.

          • Martha says:

            Also, now it’s easy to put supplementary information online, so it doesn’t all have to be in the published article — just give the reference there to the web page. So no excuse not to publish (online) what is needed to replicate.

            Also re Daniel’s comment: My experience in reading (lots of) biology papers is that typically the details on the statistical analysis are not even enough to make sense at a high level what’s been done, let alone “enough to actually re-perform” the statistical analysis. And also usually omitted is discussion of *why* the particular statical analysis was chosen/appropriate (e.g., why there is any reason to believe that the model assumptions are anywhere near valid. One example that I recall: using a method assuming bivariate normality when the graph provided of the two variables sure didn’t look bivariate normal.)

            • Martha, I agree with you entirely when it comes to statistics in bio journals, I was thinking more of the biological experimental methods themselves. This is typically what biologists care enough about to describe in moderate detail. In my experience, if a paper is written by biologists, and it has statistics in it, you can be virtually sure that the biologists do not have any idea what the statistics mean. Typically they were either told to perform the particular analysis using an off-the-shelf software, or maybe if they’re lucky they got some help from a bio-stats person, and that person probably chose something that the biologists wouldn’t feel too uncomfortable with, something that is seen in other papers in the field. often having absolutely no theoretical justification.

      • John C says:

        “… what if a researcher […] moves to Australia…”

        Clearly a strategy for immunity from replication persecution. Just don’t ask for political asylum.

    • joe says:

      I’m just a grad student (math psych / JDM), and don’t rate to join this conversation, but this makes no sense to me. Publish a paper with clear methods, include your data and include your code as appendices/supplements, and then what burden is there to take care of anyone’s needs? That someone could publish a paper for which the analyses are not immediately reproducible seems to me to be the crazy position.

  12. Rahul says:

    One of the underlying points in this debate is that publishing a failed replication of some high profile work has opened up short paths to glory for a lot of researchers, many of questionable quality.

    This has absolutely nothing to do with the scientific merits & everything to do with careers, media, funding, publicity & the flawed way in which the academic system assigns credit.

  13. In academic economics and finance the carrots and sticks are incredibly powerful. Don’t publish in the journals they consider prestigious and end up at Southwest North Dakota State – or as an adjunct making a Walmart wage, with no health insurance for your family. Or play their game well, and be a professor at a top state university, or Ivy, making six figures, or high six figures, with lots of prestige and perks.

    So much wrong and ugly about the system and what it rewards and penalizes. Of course, humanity imperfectly lurches forward; look where we were 300 years ago, or 3,000, or even 50, in the all important science, medicine, and technology. Anyway, two things:

    1) There’s no reward for duplication of studies, and mammoth penalties – you’ll spend huge time to do it well and get zero in return. Meanwhile, they’ll ask why you haven’t published anything in all that time you spent on duplication, and kick you to the curb.

    2) There’s little reward for explaining and interpreting existing models to the real world and policy, even though that’s so crucial to society. If it’s not a new model or new empirical analysis, no pub, little credit. As a result, we often get frightening interpretation of existing models and data to reality and policy, that’s so harmful and sub-optimal. I recently spent a ridiculous percentage of my five minutes per week of free time understanding and intelligently interpreting Wallace Neutrality (Wallace AER 81), because of the amazingly stupid and harmful literal ways I saw it interpreted to reality by top professors with influence. It’s a lot of work to really understand a model intuitively and interpret it well to reality and policy, but it’s unlikely to be a pub. I got some blog mentions, like by Mile’s Kimball:

    http://blog.supplysideliberal.com/post/97788494421/richard-serlin-in-theory-but-not-in-practice-the

    But unless I get a new model out of it (maybe), there will be very little reward.

    • Andrew says:

      Richard:

      I agree with what much of you wrote but I don’t think it’s not completely true. In your link you refer to some discussions with Brad Delong and Noah Smith, both of whom are prominent within the economics profession and more generally largely because of their efforts in explication, that is, from their blogging.

      So there is some reward for explaining and interpreting existing models, as that’s a lot of what Delong and Smith do. But I agree on your point 1, that there’s virtually no reward for duplication or replication. And, regarding point 2, there don’t seem to be a lot of niches available for explicators/bloggers, even in econ.

      Or, to put it another way, if you want to be an explicator, you have to do it in your spare time or, if you’re paid to do it, it probably won’t be by an academic or quasi-academic institution. (Three possible exceptions I can think of are Dean Baker, Tyler Cowen, and Alex Tabarrok, but all of them work for advocacy organizations of one sort or another, which is a bit different than the traditional academic position.)

      • Little time, but the rewards for really understanding the models and empirical studies intuitively and well, and being able to interpret them intelligently to reality and policy, are very very small, outside of what it can do to increase pubs. And as a result, we get prestigious economists often giving horrible advice (see Paul Krugman’s constant complaints), and saying very stupid things. Worse, so few economists in their area have good intuitive understanding with regard to the real world implications, so they can’t catch them and call them out.

        It’s almost all new pubs in reward, punishment, and prestige, and almost not at all how well do you understand the models and statistics implications for reality and policy; in other words how well do you really understand economics as far as applying it to our real world concerns.

        The exceptions are tiny. Noah is very rare, and probably will still be judged 80-95% on his pubs when the tenure decision comes; I hope he’s ready. Brad’s name comes overwhelmingly from his Berkeley position, which came almost completely from pubs.

        Likewise, you can talk until you’re blue in the face about duplication, but almost no one will do it with the current no reward, and mortal danger from spending time at it. I’d add a huge issue is pretty much no one, referee included, checks crucial things like the computer code, how well the optimization was done, global or local, what specifics, how was the data gathered, and by who – an exhausted grad student who barely speaks English and has no understanding of the financial documents she’s evaluating, and so on.

        I wonder what the reward incentive is in the hard sciences, where duplication is much more common, although things can be much more transparent and easy to duplicate there.

  14. EJ Wagenmakers says:

    Luckily, social psychologists are generally not “fine French bakers” to the same extent that biologists are. It is evident (to me, at least) that most if not all of the experiments that fail to replicate are straightforward to conduct for anyone with a PhD in psychology. Any recourse to a “secret sauce” is problematic on its own, but I’ll leave that be. The experiments at hand involve simple manipulations and simple dependent measures, and most of the time these are clearly described in the article. The fine-French-baker argument for social psychology (or experimental psychology, for that matter) — no, I don’t buy it.
    E.J.

    • Thom says:

      Neither do I. I have sympathy for certain methods where set-up of equipment etc. can matter hugely. My worry is that the “fine French baker” analogy reflects a fundamental misunderstanding. I think they are referring to fine-grain differences in materials and instructions (and so forth), but these are the very factors that lead to heterogeneity of effects and that Andrew and others are trying to highlight.

    • Chris Crandall says:

      EJW: If you assert that “experiments at hand involve simple manipulations and simple dependent measures,” then you are clearly arguing that the replications are cherry-picked to be easy to do. In such a context, then it is true that skill is not important. To generalize to all of social psychology is to be inaccurate and lazy. When you pick easy-to-replicate studies, then the results generalize only to the category of easy-to-replicate. To say otherwise smacks of bias.

  15. I find ERP research very tricky; veterans describe it as an art form. Eyetracking is also very tricky; you have to make a lot of decisions that don’t make it into the paper. We did a bunch of co-registration studies (eyetracking and ERP combined) over the last few years, and the combination of these methods leads to a multiplicative increase in difficulty in communicating in the paper every decision that was made.

  16. Steve Sailer says:

    One big problem here is that social psychologists see the publications of failed replications of their work as tantamount to accusations of incompetence or fraud. But if we all try harder to remember that the social sciences aren’t physics and astronomy, that societies are changing all the time, well, that provides a kinder, gentler excuse for a study failing to replicate: Things Change.

    For example, say you can’t replicate the 1990s study about priming students to walk slower down the hallway by showing them words about being old. That failure doesn’t automatically imply the author of the famous study was some despicable fraud or foul-up. After all, it was easy to prime students to do lots of things in the 1990s that are hard to prime them to do today: wear flannel shirts or dance the Macarena, say.

    We don’t know all the factors that went into making students in the 1990s primable to do things that were in fashion then, so we can’t reproduce the cultural atmosphere around the original experiment.

    And primers get worn out and stop working naturally. For example, if in 1910 you said the number “23” to a bunch of college students, many of them would instantly respond “Skidoo!” Back then, that “23” was a strong primer. Now, it’s not. Things change.

    Now I’m not the world’s leading expert on how to be diplomatic, but this perspective might make social psychologists feel less like they are being personally libeled by the publication of failed repetitions.

  17. Very interesting post. From my own perspective of having done research in chemistry and physics, I found this comment troubling:

    “In her post, Schnall writes, ‘it is not about determining whether an effect is “real” and exists for all eternity; the evaluation instead answers a [simple] question: Does a conclusion follow from the evidence in a specific paper?’—so maybe we’re in agreement here.”

    This really just tries to sweep the issue under the rug, since it’s only true except when the paper claims, or is written to appear to claim, that the effect is “‘real’ and exists for all eternity.” I can’t think of too many papers that are written with the sort of caveat to the effect that “we noticed this result; it was statistically significant; but it could be gone tomorrow.” You won’t find that statement in the physical sciences, and what I read now in the literature about education and pedagogy, the results are usually cast as reliable (i.e., eternal) as those in the physical sciences.

    And many researchers in the fields have no problem selling programs and technologies based on their research to schools, departments, and legislatures, who are led to believe that underlying science reflects as basic truth about nature. I doubt sales would be so robust if the researchers presented their work as simply, “well, the conclusion follows from the evidence presented; but don’t count on it.”

    What’s really troubling here is that the attitude that a field can be considered a science without having to adhere to the standards of scientific research. Good science requires testing of earlier results; otherwise, we have no direct evidence of what statements about nature are reliable. Trying to disprove an earlier results is no sin: Confirmation is not the best way to learn; we learn through failure. Simply claiming that a consensus of peers is tantamount to scientific truth is nothing more than world-building.

    • Steve Sailer says:

      “I can’t think of too many papers that are written with the sort of caveat to the effect that “we noticed this result; it was statistically significant; but it could be gone tomorrow.””

      The funny thing is that a lot of psychology researchers are trying to do work that will be interesting to marketers. Malcolm Gladwell gets paid for taking social science research and putting it into an interesting speech to give at a corporate conference. A lot of social scientists would like to get in on some of that action themselves. And quite reasonably.

      Thus “priming” is a hot topic because it is Science but it sounds applicable to marketing.

      Marketers prime consumers all the time. They’re pretty good at it. Jack-in-the-Box commercials, for example, really do prime me to stop at Jack’s on the way home.

      The obvious problem, though, is that Marketing Wears Off. Jack in the Box has been perfectly targeting my sense of humor and my emotional affiliations with their Funny Corporate Boss commercials for 15 years or so. But eventually either guys like me are going to get bored by them or we’re all going to drop dead from eating their products, and they’ll need a different campaign.

      Take a look at the history of great marketing campaigns like: “Ivory Soap: 99 and 44/100ths Pure!” That was a great slogan in the 19th Century

      Today? Well, are you saying Ivory Soap is 0.56% impurities? Yuck.

      Marketers find the endless search for something that will work better than what they’ve got now exhausting. They want scientists to come tell them the eternal truths revealed by Science.

      But of course there is no eternal competitive advantage in marketing. It’s just an endless hamster wheel of fashion.

      But that realization at least offers a non-condemnatory way to announce that a quasi-marketing study, like a priming experiment, failed to replicate: “It doesn’t seem to work anymore.” You’re not saying the original study was a fraud or incompetent, just that things are presumably different now.

    • Steve Sailer says:

      And here’s a dignified way to characterize political science findings that eventually stop replicating: they’re now part of History.

      For example, Dr. Gelman says his Red State Blue State findings didn’t hold up all that well in 2012 based on a sample of 5000 voters. Personally, I wouldn’t be surprised if they help up better in the huge Reuters-Ipsos sample of 40,000 voters. The theory makes an awful lot of sense to me.

      But even if the future evolves in such an unlikely way that Red State Blue State never happens again, that doesn’t mean it was wrong in the past. It’s still an explanation highly relevant to historians trying to understand politics over the last generation.

      It’s kind of like how the Wishbone offense doesn’t work very well anymore in college football, but the coaches who developed it deserve their major place in the history of 20th Century college football.

    • Thom says:

      I think you perhaps misunderstand. It could be very interesting that an effect varies between individuals and between contexts – there are whole areas of psychology and other disciplines (notably biology) that study this variability. The problem isn’t that the variability exists (there are good ways for sciences to deal with this kind of variability) but that researchers present evidence of an effect in a small number of contexts as evidence of a large or reliable effect in general.

      • Actually, I think your comment is consistent with mine. My issue is the frequent mischaractering of results that are limited in scope and reliability as describing large populations or being reliable in a context much larger than reasonably supported.

        I do agree that the variability of effects can be very interesting, as you point out. But IMHO, we continue to fool ourselves into making important public decisions based on results of limited reliability and scope.

  18. Noah Motion says:

    We look at the same research question from different angles, using different methods. This is called a conceptual replication…. Although [direct replication/method repetition] can help establish whether a method is reliable…

    Is this a tacit admission that no one bothers to perform even cursory checks on the reliability of (social) psychology measurement? Shouldn’t this bother (social) psychologists?

  19. Thom says:

    Maybe – this at least in part a lack of willingness to fund the right studies and a lack of willingness to look at the available evidence.

    It is also worth noting that many researchers are much more cautious in their appraisal of the evidence – it is a shame that being a cautious, careful scientist won’t get you in the top journals as easily and won’t get you funded as easily. Those pressures exist in all fields (even in physics) – but some subfields seem more prone to it (perhaps because they attract media and funder interest more readily).

    I’m not sure any of the cases we’ve been discussing at a major impact on policy. For example educational policy changes often involve big intervention studies looking at many schools. Of course lots of policy gets decided with no evidence basis (but that’s because the evidence is being collected or selected after the political decision is made).

  20. Fafa says:

    Interestingly, her direct quote from Rosenbaum on the reliability of the Stroop task is off by a bit. The actual quote is the “Stroop effect is ONE OF the most reliable and robust phenomenon…” Schnall omitted the “ONE OF” phrase . Oops.

  21. Noah Motion says:

    Not sure why my replies in the thread above aren’t showing up, so I’m posting them here at the bottom of the discussion. I hope my interlocutors see this.

    In response to Daniel Lakeland’s comment:

    Maybe this is a case where NHST is inappropriate, but, if so, this points to a limitation of how general NHST is, not an inherent problem with its logic.

    In any case, it seems like maybe you could reason that, yes, these two groups likely differ to some degree in the population, but do they differ more than any other partition of the population into two groups?

    In response to question’s comment:

    question, I, too, find those arguments pretty compelling. I think there are lots of cases that one or another null hypothesis significance test is inappropriate for. It seems pretty clear to me that lots of people blindly apply statistical tools without thinking carefully about the assumptions that the tools rely on.

    I think in your #1, the population probably *is* something like “The top shelf of rack 4 from Room 3340C in the french facility”, which likely has very severe implications for how general any findings with samples from that population are (i.e., the findings may well generalize to only that population and no other).

    For #2 and #3, I think the dimensions/characteristics/factors people intentionally ignore or don’t know about constitute (a large part of) the error in whatever statistical model is used. I don’t have a good sense of how often these neglected factors are well modeled by, e.g., normal error, and it doesn’t seem like people probe their models’ assumptions all that often (I know I’ve been guilty of this more often that I should be).

    With regard to #3, right, but this isn’t a problem with NHST per se, I don’t think. It’s a potential problem with whatever links there are between non-statistical hypotheses and statistical hypotheses. It seems to me that this problem applies to analyses that focus on estimation and modeling rather than testing, as well.

    For that matter, #1, #2, and #4 are issues for non-testing-based approaches, too. To the extent that we want to generalize beyond our sample, we are restricted to generalizing to whatever the appropriate population is. And anything we don’t explicitly model ends up (at risk of) being treated as error, whether or not that’s our intention.

  22. […] Schnall on replication projects and replies to and discussions on her […]

  23. […] failed to replicate, which is still ongoing (see here and here, here, on Andrew Gelman’s blog, and on Rolf Zwaan’s […]

  24. […] to do the right thing. But he admits that he hasn’t kept up with what we now know as the Replication Crisis – the fact that many research papers, especially in the social sciences, cannot be […]

Leave a Reply