“Priming Effects Replicate Just Fine, Thanks”

I came across this 2012 post by John Bargh who does not seem to be happy about the failures of direct replications of his much-cited elderly-words-and-slow-walking study.

What strikes me about Bargh’s comments is how they illustrate the moving-target approach to much of science.

Here’s the quick story. In 1996, Bargh, Chen, and Burrows published a paper with the striking finding that students walked more slowly when they were primed with elderly-related words such as bingo and Florida. The result was statistically significant at the 5% level.

The Bargh et al. paper has been influential and has been cited hundreds of times. But recent attempted replications of the effect have failed, which leads many outsiders (including me) to suspect that the original finding was a classic garden-of-forking-paths power=.06 story of an opportunistic data analysis.

But here’s what Bargh wrote:

There are already at least two successful replications of that particular study by other, independent labs, published in a mainstream social psychology journal. . . . Both appeared in the Journal of Personality and Social Psychology, the top and most rigorously reviewed journal in the field. [JPSP also published Bem’s notorious ESP paper — ed.] Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.

Hull, J., Slone, L., Metayer, K., & Matthews, A. (2002). The nonconsciousness of self-consciousness. Journal of Personality and Social Psychology, 83, 406-4254.

Cesario, J., Plaks, J., & Higgins, E. T. (2006). Automatic social behavior as motivated preparation to interact. Journal of Personality and Social Psychology, 90, 893-910.

Moreover, at least two television science programs have successfully replicated the elderly-walking-slow effect as well, (South) Korean national television, and Great Britain’s BBC1. The BBC field study is available on YouTube.

OK, I think we can just pass by the replication-by-TV-show argument in polite silence.

More interesting is the case of the so-called replications by Hull et al. and Cesario et al., which follow the now-familiar pattern of whack-a-mole or chase-the-grain-of-rice-around-the-plate.

A study is performed, a statistically significant correlation is found, and the results are published. Then in an attempted replication, the effect no longer appears—but there is a statistically significant interaction. Then an other attempted replication, another interaction.

From Bargh’s point of view, this must look like science at its best: each new study brings new insight. A mere replication would be boring—maybe useful in quieting the skeptics, but that’s about it. But a new interaction (a “moderator”): that’s exciting, new stuff. Two new studies, two new interactions.

From my perspective, though, this is all consistent with noise mining, with statistical significance arising from zero (or, more precisely, highly variable) effects plus chance variation. It’s the garden of forking paths: with so many potential interactions, there are so many ways to win, to get “p less than .05.”

Does this mean I think interactions should be set aside? No, not at all. I’ve been on record for years as saying that interactions are important. Much of my own most successful applied work has involved interactions.

But . . . how seriously does Bargh himself take interactions? He mentions there papers: his original article with no interactions, the second paper with interactions with self consciousness, and the third paper with interactions with attitudes toward the elderly.

That’s all fine, but in that case, why not look at all of these interactions in all of the studies. What looks suspicious to me is that the interactions are only looked at when they are statistically significant. But, again, seeing occasionally statistically significant interactions is exactly what we would expect, from chance alone, if nothing were going on.

Bargh concludes:

Research has now moved on from the demonstration and replication of priming effects on social judgment and behavior to research on the mechanisms underlying the effects and the moderators, constraints, and limitations of those effects.

Ummmm, no. Bargh’s research may have moved on, and that’s fine; it’s good to move on and study new things. But for many of the rest of us, no, these effects have not been demonstrated, and the failed replications make the whole thing look like the sort of mess that Paul Meehl wrote about, decades ago.

And, no, I don’t think replications on Youtube count for much.

108 thoughts on ““Priming Effects Replicate Just Fine, Thanks”

  1. Andrew writes:

    “But . . . how seriously does Bargh himself take interactions? He mentions there papers”

    Does he mean

    He mentions their papers
    or
    He mentions three papers

  2. Whack-a-mole indeed.

    Saw a clip recently on some news show (can’t find it now of course). Claim was that they could show a horse a picture of someone with some expression on his face (angry, happy, …) and the horse could distinguish them. Experimenter was in stall with horse holding picture in front of horse’s face but horse could see both picture and experimenter. No control over what the experimenter did (at least, none obvious).

    Clever Hans, anyone? How old is that one, at least a century I believe.

    Yep. Debunked in 1907. https://en.wikipedia.org/wiki/Clever_Hans

  3. “A study is performed, a statistically significant correlation is found, and the results are published. Then in an attempted replication, the effect no longer appears—but there is a statistically significant interaction. Then an other attempted replication, another interaction.

    From Bargh’s point of view, this must look like science at its best: each new study brings new insight. A mere replication would be boring—maybe useful in quieting the skeptics, but that’s about it. But a new interaction (a “moderator”): that’s exciting, new stuff. Two new studies, two new interactions.”

    I fear this will become more and more prevalent in the years to come. As replications of famous findings will fail, the researchers in the field will perform new under-powered, p-hacked studies like they are used to and will no doubt find all kinds of moderators and who knows what. Then a new rigorous replication will focus on these moderators and will find nothing again. Etc. etc.

    For a glimpse into the future, it might be interesting to note the case of the “unconscious thought theory”:

    1)Critique and failed replications.

    2)Then a meta-analysis by Strick et al. (Strick, M., Dijksterhuis, A., Bos, M. W, Sjoerdsma, A., & Van Baaren, R. B. (2011). A meta-analysis on unconscious thought effects. Social Cognition, 29, 738-762.) showing all kinds of moderators.

    3)Then a rigorous replication study which took these moderators into account (Nieuwenstein, M.R., Wierenga, H.T.C., Morey, R.D., Wicherts, J.M., Blom, T.N., Wagenmakers, E.J., & Van Rijn, H. (2015). On making the right choice: A meta-analysis and large-scale replication attempt of the unconscious thought advantage. Judgment and Decision Making, 10, 1-17.) which showed no effect again.

    4)Then this recently published review article coming up with another set of moderators/ things to take into account (Dijksterhuis, Ap. P., & Strick, M. A case for thinking without consciousness. Perspectives on psychological science, 11, 117-132.)

    It’s a never ending cycle. Researchers can stretch this process out over decades. Which is good for the original authors: lots of citations ! I however doubt any real progress has been/ will be made. To me it’s all a giant waste of time, money, and energy.

    It all makes me think about how to stop this from happening, and all i can come up with is that it might be useful to set a higher standard for publication in the first place. I really like the idea of Registered Reports (https://osf.io/8mpji/wiki/home/) as a way to stop false positive findings from entering the literature in the first place.

    • W:

      I do think this stuff will go away soon, or at least be less prominent. It’s my impression that most psychology researchers dislike all this gimmicky research on embodied cognition, himmicanes, ovulation and voting, power pose, etc etc. They don’t like it on substantive grounds and they’re embarrassed that this is the psychology research that’s getting publicity. So I think it will become harder to publish this sort of thing in Psych Science, JPSP, etc, it will be harder for people who do this sort of work to get good academic jobs, etc. It’s a fad that’s ending. I’m sure it will still sell lots of business books, but the high tide the receded, and pretty soon the Barghs of the world will pretty much just be talking to each other, with nobody listening to them but Malcolm Gladwell, the organizers of Ted talks, and NPR producers. And then eventually Gladwell, Ted, and NPR will get the clue too, and this sort of work will be just a time capsule, the 2000-era equivalent of 1970s science fads such as spoon-bending, Chariots of the Gods, and the Bermuda Triangle.

      • Or maybe a better analogy, given the academic context, would be the deconstructionists. They used to be considered a sort of menace but now they’re just a punch line. They talk with each other but the rest of us can just ignore them.

      • Okay, but Malcolm Gladwell found a huge audience for Bargh’s priming study less among academics than among marketers.

        People in the marketing business are constantly trying to prime consumers to do things. And to some extent they can, but it’s hard work, in part because what worked in the past doesn’t always work today. For example, Bill Cosby’s Jello pudding pop ads moved a lot of product 35 years ago, but probably wouldn’t today. Marketers apparently like being told that there are underlying scientific principles because that promises them an escape from the constant grind of trying to gin up new fads. They want to learn how to make their manipulations of the public replicable.

        I think it would be helpful to try to publicize more broadly the philosophy of science point that the social sciences, even at their best, aren’t as always as replicable as the physical sciences.

        In particular, studies of how to manipulate people might well tend to have a relatively short shelf life. The gravity waves finding announced yesterday ought to replicate in 35 years, or something is very wrong. In contrast, the success of 1981’s Bill Cosby ad won’t necessarily replicate in 2016 no matter how scientifically its success was measured in 1981. College students being primed into walking slow in 1996, even if the effect existed (which, granted, it probably didn’t), might have just been a fad that seemed cool at the time, like dancing the Macarena.

        This would provide a second reason not to be so credulous: maybe the scientists are manipulating their analyses, or maybe they did everything exactly right but just happened to discover a fad.

        Both reasons point toward replication being necessary before we put these results into the textbooks.

        • Also, the “maybe it was just a fad” argument makes critiques of the scientists less of a personal attack on their characters and gives them an out: okay, the effect I found seems to have worn off.

          Granted, maybe we need more personal attacks on scientists, but offering them a line of retreat like this may be helpful overall.

        • Steve:

          Sure, but not just that. I don’t see any convincing evidence that Bargh’s interventions ever worked, even at the time. The data seem consistent with zero effect (or a small positive effect or a small negative effect) plus creativity in data analysis.

          To put it another way, I suspect that had Bargh reported the exact opposite result—priming students with elderly-related words makes them walk faster (a reasonable hypothesis, in that the priming could subtly remind the students of their youth)—everything would have gone the same. Same Gladwell treatment, same positive message for the business audience, same 300 papers of follow-up finding results consistent with the “priming with elderly-relates words makes people walk faster,” same insistence by Daniel Kahneman that we have no choice but to believe these findings, same failed replication a couple decades later, same comments by Hal Pashler, E. J. Wagenmakers, and myself saying we don’t believe it.

        • Rahul:

          All three authors of the power pose paper teach at business schools. But marketers want real science too. From a marketing point of view, the most impressive thing about power pose is not the empty experiments that demonstrate nothing; it’s the success by which the idea of power pose has been marketed to Ted, NPR, etc. That’s the case study worth studying.

        • Against the backdrop of the other crap people can be made to believe (e.g. homeopathy, Reiki, weight loss fad diets ) that’s probably a non-story to the hardcore marketers.

    • “I fear this will become more and more prevalent in the years to come. As replications of famous findings will fail, the researchers in the field will perform new under-powered, p-hacked studies like they are used to and will no doubt find all kinds of moderators and who knows what. Then a new rigorous replication will focus on these moderators and will find nothing again. Etc. etc.”

      Yep, exactly in line of what “super star” social psychologist Ap Dijksterhuis seems to propose. He does this after another one of his studies failed to replicate. If i am not mistaken, he and his “expertise”, seemed to have even been part of the development of the replication protocol!

      https://www.psychologicalscience.org/redesign/wp-content/uploads/2017/11/RRR_ProfPrime_Ms_171013_ACPT.pdf

      “Dijksterhuis and van Knippenberg (1998) reported that participants primed with an intelligent category (“professor”) subsequently performed 13.1% better on a trivia test than participants primed with an unintelligent category (“soccer hooligans”). Two unpublished replications of this study by the original authors, designed to verify the appropriate testing procedures, observed a smaller difference between conditions (2-3%) as well as a gender difference: men showed the
      effect (9.3% and 7.6%) but women did not (0.3% and -0.3%). The procedure used in those replications served as the basis for this multi-lab Registered Replication Report (RRR). A total of 40 laboratories collected data for this project, with 23 laboratories meeting all inclusion criteria. Here we report the meta-analytic result of those 23 direct replications (total
      N = 4,493) of the updated version of the original study, examining the difference between priming with professor and hooligan on a 30-item general knowledge trivia task (a supplementary analysis reports results with all 40 labs, N = 6,454). We observed no overall difference in trivia performance between participants primed with professor and those primed with hooligan (0.14%) and no moderation by gender.”

      Here is the reply by Dijksterhuis:

      https://www.psychologicalscience.org/redesign/wp-content/uploads/2017/11/Dijksterhuis_RRRcommentary_ACPT.pdf

      “A cumulative research program by one or two individual labs may yield much more diagnostic results”.

      Just as predicted! The entire cycle of low-power, p-hacking, and publication bias can start all over again. From Meehl (1990): “As I put it in a previous paper on this subject (1978), theories in the “soft areas” of psychology have a fate like Douglas MacArthur said of what happens to old generals, “They never die, they just slowly fade away.”

      • “Yep, exactly in line of what “super star” social psychologist Ap Dijksterhuis seems to propose”

        Perhaps not only propose, but already performed!

        If i am understanding everything correctly, it looks like he performed 2 studies to “verify the appropriate testing procedures”, that were not published, probably under-powered, and possibly p-hacked.

        This lead to the “insight” of gender being a possible moderator, which was not confirmed in the highly-powered replication.

        After which, Dijksterhuis comes up with yet another possible moderator.

        So, in my reasoning Dijksterhuis himself already refuted that “A cumulative research program by one or two individual labs may yield much more diagnostic results”.

        That may be his biggest contribution to the field in my opinion.

        • Scientific American says that the Pythagoreans treated odd numbers as masculine (7th paragraph):

          http://www.livescience.com/15859-odd-numbers-male-female.html

          The same claim is made from a course web site at Dartmouth:

          https://www.dartmouth.edu/~matc/math5.geometry/unit3/unit3.html

          I love the reasoning: “Odd numbers were considered masculine; even numbers feminine because they are weaker than the odd. When divided they have, unlike the odd, nothing in the center. Further, the odds are the master, because odd + even always give odd. And two evens can never produce an odd, while two odds produce an even.”

          As Andrew often says, it could be true for some people in some situations but not true for other people in other situations. See how quickly we start playing whack-a-mole?

        • Andrew, according to Wikipedia it’s the opposite; the Pythagoreans believed odd numbers are male and evens are female. Can Wilkie and Bodenhausen count this as a pre-registration?

        • That’s a good point. What if someone produces a hypothesis they never pregistered after seeing the data. Then they found out that 100 years ago someone had the same hypothesis. Does that count as preregistration?

          You can have endless fun with preregistration:

          http://www.bayesianphilosophy.com/a-modest-proposal-to-fix-science/

          Historians of science are going to look back on our time as the “Alice in Wonderland” period of science. They’ll wonder about us the same way we wonder about time periods were they burned witches.

        • To make this concrete, here is how wikipidia describes the method Kepler used to discover the elliptical nature of planetary orbits:

          He then set about calculating the entire orbit of Mars, using the geometrical rate law and assuming an egg-shaped ovoid orbit. After approximately 40 failed attempts, in early 1605 he at last hit upon the idea of an ellipse, …

          This is the worst kind of data dredging on observational data imaginable. It’s the Garden of Forked Paths on steroids. But the rest of the sentence is:

          …which he had previously assumed to be too simple a solution for earlier astronomers to have overlooked

          But what if it wasn’t overlooked. What if it was considered by the ancients, but they made a mistake, or didn’t have accurate enough data, or did discover it but this solution was lost like most of ancient science was?

          Would this make Kepler’s post hoc hypothesizing valid suddenly? If it’s still invalid, then how do you explain the fact that this was a pivotal discovery in the history of science? Kepler’s style of approach wasn’t fundamentally different than that used to decipher the double helix nature of DNA.

        • Laplace, I tried to make a post on your blog the other day but it never showed up. I’m curious as to whether you chose not to post it for whatever reason or there was some technical issue.

      • After a similarly skeptic duscussion on Twitter a year ago, I replicated this study with my students some months ago. Found very similar main effects, as I remember. Just because you don’t think it’s plausible or relevant, doesn’t mean it’s not robust.

        • I do think the topic of the investigation is kind of weird but maybe I am just not creative enough to recognize situations where someone would care about this kind of work. I did not mention it in my first comment, but my main concern with the original study is that the reports of the effect seem “too robust”. All the experiments in Wilkie & Bodenhausen worked, but oftentimes just barely.

          Exp. 1A: the gender-number effect is significant with p=.04

          Exp. 1B: the gender-number effect is significant with p=.046

          Exp. 2: the gender-number effect is significant with p=.03

          Exp. 3: the gender-number effect is significant with p=.016

          MTurk experiment described in Conclusions: the gender-number effect is significant with p=.03

          Maybe there really is an effect here (and maybe it is important for some scientific theories), but the studies in Wilkie & Bodenhausen are odd because with the kind of studies they reported one would expect to see some non-significant results, even if there really was an effect. The absence of non-significant studies suggests some kind of publication bias or a garden of forking paths kind of situation. For such studies, I do not trust the reported results; at best they overestimate the true effect and at worst they can show an effect when none exists at all. I have more faith in your single experiment than in the multiple findings reported by Wilkie & Bodenhausen. Have you published your result?

  4. I have a question that’s kind of unrelated, though this is maybe the best place to ask it. The “garden of forking paths” problem is intuitive to me, but what I don’t understand is why a similar problem doesn’t occur when you do Bayesian inference with model checking and model expansion. This is how the process is supposed to work, if I’m understanding it correctly. First you build a model and fit it to your data. Then you sample virtual data from the fitted model and you check whether the actual data fits (i.e. is a typical realization of) the distribution of the virtual data. If the fit is bad, you note in what ways the fit is bad, and then — conditional on what the data told you — you build a new and more complex model designed to make the fit better. Then you do the same procedure with the more complex model, and you keep going until you have a model that shows a satisfactory fit with the data. Every step of the way here your modeling choices are conditional on the outcome of the predictive checking, all of which is done on the same data set. That seems like a problem; why does it not lead to overfitting? Note that I am not saying that the problem is that the data set is used twice when you do the predictive checking (I know you’ve gotten that objection before); the problem is that your construction of the new and more complex model is conditional (in a forking-paths way) on the outcome of the predictive checking, and the new and more complex model is then checked on the very same data set again, and this procedure is iterated. I’m very sorry if I have mischaracterized how posterior predictive model checking works!

    • Olav:

      Yes, it’s an issue. There’s a back-and-forth between the statistical model checking and the substantive theorizing. A big difference is that I’m not doing “NHST”; that is, I’m not taking rejection of null hypothesis A as evidence of substantive hypothesis B.

      • Thanks for the reply. It’s a fascinating procedure to think about. The obvious fix to the forking-paths/overfitting problem with posterior predictive checking would seem to be to hold out data, i.e. to divide the data into several separate “training” sets, one for each predictive checking/model expansion iteration. But the problem is that the procedure is open-ended — that is, the number of iterations can’t be known in advance — so it’s unclear how much data should be held out. And, besides, holding out data seems decidedly unBayesian.

        • When data have a hierarchical structure, there is also the problem of dividing the data in a way that preserves the hierarchical structure.

  5. “OK, I think we can just pass by the replication-by-TV-show argument in polite silence.”

    “And, no, I don’t think replications on Youtube count for much.”

    I agree with the point of your post, and certainly the criticism of these sorts of studies. I’ll point out, though, that your dismissal of replication-by-TV-show, or YouTube, sounds just like others’ criticism of your blog posts for being non-peer-reviewed and therefore not part of “real” scientific discourse. You’ve (justifiably) complained about these criticisms in the past. If you believe that we should assess scientific communication by its actual quality and not its medium, impact factor, etc., then this needs to apply uniformly — i.e. you need to write what you think is done badly in the BBC experiment, and not dismiss it because it’s on TV. (I haven’t seen the show, nor do I care to. I have, however, had my students extract very good data from a Mythbusters episode to infer things about thermodynamics!)

    • Raghuveer:

      Fair enough. If the video comes with a full description of the experiment, then it’s as good as a research paper. I took a look at the Youtube in question and I’d say it’s not a research contribution in that way. Rather, if you already accept the effect as real, the video is an excellent demonstration of it. But it does not at all represent independent evidence. For one thing, it looked like there were 4 people who were experimented on in the video. N=4 or can be fine for qualitative insight but given the variation in this sort of data—even the variation given in Bargh’s own papers—you can’t learn anything useful from N=4 about any sort of average effect.

      So the issue is not Youtube, the issue is that it’s not a replication. It’s a demonstration, which is something entirely different. In a demonstration you want to find the effect, it’s even ok to rig things a bit, no need for researcher blindness, anything like that. Classroom or public demonstrations are fine—Deb Nolan and I have a whole book full of them—but we shouldn’t confuse them with replications.

  6. It would be fun to get a TV show with a big budget, like “60 Minutes” or John Stossel, to film an attempted replication.

    TV could help with a big problem in public comprehension of “statistical significance:” Something that most people don’t realize is how trivial the effect Bargh reported was — something like primed students walked down the hall 6/7ths as fast as unprimed students. They read that it was “statistically significant” and that seems even more important than “significant,” when if you showed video of students walking down the hall in 6 seconds and 7 seconds, the difference would seem insignificant.

    Now that I think about it, you probably wouldn’t need a 60 Minutes sized budget. You could get a few film students up from NYU to set up the lighting and videocameras and do it cheap and put it on Youtube.

    • Steve:

      I don’t recommend this for the same reason that I don’t recommend people bother with preregistered replications of such studies. Effects sizes are so small and variation is so large that if you do it right you won’t find anything. I could be wrong, of course, but given the evidence I’ve seen, that’s what I’d guess. The BBC video that shows an effect, even though with N=4 it’s all random, is if anything a demonstration of how, in a poorly controlled experiment, you can find what you’re looking for.

      Effective TV is believing that you know the answer ahead of time and then demonstrating it on 4 people. Boring TV is having an open mind and then getting data that show no clear pattern.

      • Has anybody yet tried to make a video demonstrating how the public is having their chains yanked by priming experiments? Use split screen techniques, multi-images, graphics, etc. Put it up on Youtube and try to get professors to show it in class.

  7. I find this blog post to be quite lazy. Has the author read Joe Cesario’s Perspectives on Psychological Science 2014 article on replication in priming studies? Cesario takes the issue of replication of these studies quite seriously. To call has approach a whack-a-mole approach is strident and to my eye simply not paying attention to the the thought that has gone into those who have taken the issue of moderators in priming effects seriously. Better scholarship is possible. Simply dismissing thoughtful work in such uncharitable ways is a huge part of the problem, in my opinion.

    • Steve:

      Cesario can take the topic as seriously as he’d like, but if the signal is weak and highly variable and the measurements are noisy, it just doesn’t matter. I’m sure Bem takes ESP very seriously too. And in the physics department they could be very seriously working on cold fusion. Seriousness is great but it’s no substitute for having a large and stable effect to study.

      Regarding the moderators: Given the failed replications of the Bargh studies, perhaps we can all agree that that 1996 paper was wrong, that they just found a statistically significant result from chance, the same way that Nosek et al. did in their first “50 shades of gray” study (before they went to the trouble to do their own preregistered replication and found to their dismay that their discovery did not replicate). But you’re claiming that the later study by Cesario is for real. Maybe so. The results are also consistent with there being nothing going on, but with creative researchers managing to find statistical significance in study after study by excluding data, looking at subgroups, looking at interactions, etc.

      I’ll outsource the psychology research here to Pashler et al. My expertise is not in psychology but in statistics. And as a statistician I can assure you that statistical significance in a series of open-ended research studies tells us just about nothing.

      Again, I’m happy to accept that Bargh, Cesario, etc., are thoughtful. But thoughtful ain’t enough. That, in some sense, is why the whole field of statistics is necessary: thoughtful, creative experimenters without good statistical guidance can and will find patterns from noise.

        • Rahul:

          I don’t think most psych professors take priming seriously. Or, to put it another way, all of us, psychologists included, accept that priming exists. But I don’t think most psych professors take seriously the studies claiming large and consistent effects from the sort of super-subtle primes used in many of these studies.

          Real priming—you give people information or put them in a certain setting and then they’ll think differently—that definitely exists. The work of Bargh etc. is almost perverse in that they set up situations where the priming is so subtle that it won’t work, then they start looking for it.

          But, sure, the cold fusion analogy is a bit over the top.

      • P.S. to my response above to Steve: Let me just quote from Bargh’s post:

        Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.

        This is exactly what you’d expect to see from highly motivated researchers studying a null effect. A replication fails but it is salvaged by coming up with an interaction that is statistically significant. Another replication fails but it is salvaged by coming up with a different interaction. It makes you wonder why, if these interactions made so much sense, that they weren’t examined in the original Bargh et al. study. Or, for that matter, why interaction A wasn’t examined in study B, or interaction B examined in study A.

        I’m not suggesting any malign behavior on the part of these researchers: rather, I’m suggesting that they’re doing standard practice, which is to go through your data and find interactions, then declare victory when you attain statistical significance. As Bargh writes, these studies appeared in “the top and most rigorously reviewed journal in the field.” So I can hardly blame them for following what were considered at the time to be best practices. But I can blame them for not reassessing in the light of statistical theory explaining how these formerly-standard practices can lead to finding statistical significance from noise, and for not reassessing in the light of empirical evidence from failed outside replications. At some point, ignorance is no excuse.

        • Maybe Andrew should stress this point in his critiques. If you find a new moderator (this applies to the red color study too) then being able to consistently find it in subsequent replications strengthens our belief that the effect may be there.

          Andrew, you say in response to Steve that there is “no substitute for having a large and stable effect to study”. You have however pointed out in other threads that effects can be small but worth studying (because they can be important for the theory). I think that one should only say that there is no substitute for having a stable effect to study. Size doesn’t matter, or only matters depending on context. If these priming guys find a small but stable effect, that’s informative as far as the study of cognition for its own sake goes, it may have no real world implications. The same might hold for Cuddy’s work. In medicine I can see that size really starts to matter; a treatment that prolongs life for a short time (but causes a lot of suffering, say) may not be worth deploying no matter whether there is a “statistically significant effect” of survival or not.

          To me Bem seems to fall in a different category; and so does the beautiful couples have (what was it?) more girls stuff. That’s just the type of thing the Berliner Zeitung headline writer thinks up.

        • Shravan wrote: “If these priming guys find a small but stable effect, that’s informative as far as the study of cognition for its own sake goes, it may have no real world implications. The same might hold for Cuddy’s work.”

          I think these two examples (priming — at least special cases — and Cuddy’s work) may indeed have real world implications(albeit not as severe as the medical example you give).

          Priming: One special case is the “stereotype threat” situation, where a priming intervention concerning race or gender is asserted to improve performance on a difficult math test. These have been promoted as the basis for academic interventions to improve minority performance or self esteem. But if there is indeed a small, even though stable, effect (or no stable effect), such interventions may be a waste of time and effort, especially if more effective interventions are possible.

          Cuddy’s work: It seems (just judging from the popularization of Cuddy’s work — TED talk, etc. — and her “defense” of it) that belief in the “power pose” effect can promote false feelings of confidence that lead to poor performance.

        • Yeah, I agree with your comments.

          In fact, your comment reminds me about a peculiar thing about Cuddy’s claims; how do they relate to the finding that the less competent you actually are, the more you think you are competent. “Unskilled and unaware of it” was the article I once read, I think. I see this a lot in beginning students; they sometimes come in with an imagined capability to do research, and their self-confidence is often inversely proportional to their actual ability. The more competent students tend to be more circumspect about their own abilities. Cuddy’s slogan, “fake it till you make it” (Ted talk) seems to encourage false confidence, and might exaggerate even more the over-confidence that is so damaging to people starting out in acquiring new skills. A little bit of self-critical ability (maybe a lot of it), and a good understanding of one’s own limitations, is needed to make serious progress.

          The paper is here: http://psych.colorado.edu/~vanboven/teaching/p7536_heurbias/p7536_readings/kruger_dunning.pdf

        • Actually, Cuddy herself would benefit from reading the paper I linked to. I don’t think she really understands the depths of her own ignorance about what the statistical analyses tell her and don’t tell her about power posing.

        • @Shravan

          Yes, the “unskilled and unaware of it” hypothesis does agree with my own observations of students as well.

          Thanks for the link to the Kruger-Dunning paper; I don’t recall seeing it before. I’ve glanced at it, but with my usual critical eye, I did notice what seems likely to be a statistical overreach: Toward the end of p. 1123, they say, “were also marginally higher than a ranking of ‘average’ (i.e., the 50th percentile), one-sample t(15) = 1.96, p < .07." That sounds like they are confusing "marginal statistical significance" (which is an iffy concept anyhow, particularly in a study with multiple comparisons) with small effect size.

        • This is not what happened in this case and anyone who knows the literature understand that. Higgins model has predicted long before this study that activation is moderated by individual differences and motivational states. That is what he was studying for at least 20 years. To someone who know the field that Higgins and his students would think that priming is moderated by individual differences and motivational states would be expected. This simply is not a case of someone ferreting in the data to find a moderator to justify the failure to find a main effect. It is a theoretically derived prediction of moderation and if you and the author of this blog had done their homework or even had read thoughtfully the recent piece by Cesario in Perspective, then you would know that is the case. I find this to be lazy scholarship on your part and that of the blog poster. You have to read the literature to know what is a theoretically derived prediction of moderation and what is a post-hoc justification and it is important to make that distinction properly. This blog post and the commentary does not do so.

        • Maybe this is an insider vs outsider perspective difference.

          Have you discussed something like this with smart outsiders? Rather than lazy scholarship this is what academic “inbreeding” gets you. It could serve as a useful reality check to bounce off some of these studies & ideas you seem to be in awe of on some smart outsiders whose opinions & scholarship you respect.

        • I am not an insider. I am just an social psychologist who knows the literature. Knowing the literature can help you sort out whether a moderator is an a priori prediction based on previous theory or a post hoc attempt to redeem a failed attempt to get a main effect. In this case if you bother to review the literature or even read the Cesario, Plaks, and Higgins paper carefully. In my view if you are going to make the charge that a moderator prediction was post hoc and an abuse of analyses then you at least need to do your homework and see if the moderator would have been predicted by the new author’s theory and in this case their is a long history that it would have been. The authors claim this prediction was based on their theorizing and that fits with the literature. This should at least be noted. It is at least plausible that the results were predicted a priori and that could have been known by the blog author. We can never know for sure if that was the case, but we jump to the conclusion it was post hoc, we should at least consider the evidence that it was predicted a priori. To not do so is lazy scholarship in my view, and a less lazy scholarship in this case would have made it clear it is at least plausible and I would argue highly probable that the results of the Cesario, Plaks, and Higgins study were a priori predicted. I have discussed this paper with Cognitive Psychology colleagues skeptical of priming and once the relevant theories have been explained, they have agreed that it is likely the results of the Cesario, Plaks, and Higgins study were predicted a priori. Have you as an outsider done your homework and examined the theories in question before coming to a conclusion about whether it was post hoc or a priori?

        • Steve Spencer:

          Thanks for continuing the discussion. It is helpful to explore these issues.

          You write, “In my view if you are going to make the charge that a moderator prediction was post hoc and an abuse of analyses. . .”: I didn’t “make any charge.” It’s the opposite. Papers such as the one you cite get published after getting p less than .05, and these p-values are conditional on the assumption that the particular set of analysis that was reported in the paper, is exactly what would’ve been done and reported, had the data been different. But I have no reason to believe this. The authors provide no evidence that they would’ve done the exact same data processing, analysis, and reporting had the data been different, and in their paper they did not make such a claim. And such a claim—which, I repeat, the authors did not make in their paper—is highly implausible. What if they’d found a statistically significant main effect? Of course they would’ve reported that? How come different papers in this same literature look at different interactions? How come, if the particular interaction in the paper you cite is so important, that it was not included in various other papers in this field, including for that matter the original Bargh et al. paper?

          Are there any grounds for considering the conclusions of Bargh et al. (1996), which did not include this interaction, to be valid, but to consider the later failed replication by Pashler et al., which also did not include this interaction, to be invalid?

          Again, the statement that “activation is moderated by individual differences and motivational states” admits essentially infinitely many choices in data process, analysis, and presentation. I don’t disagree with your claim that the results in the study you point to “were a priori predicted.” I doubt they would’ve done their experiment in the first place had they not predicted success. The trouble is that a prediction of success can map to many many many data patterns, enough so that there would be no difficulty finding statistical significance, even in the absence of any true effect or, more to the point, in the presence of true effects that are so small and so highly variable that any statistical evidence from studies of this size are meaningless.

        • Steve, as a fellow social psychologist my view is that almost any post hoc finding can find a plausible a priori seeming justification. Norbert Kerr pointed this out back in the 20th century and it is easy to illustrate this with students. Give them an arbitrary list of psychological variables, ask them to select three at random (A, B, C), and then ask them to come up with a theoretically plausible reason why A might moderate the relationship between B and C. I’ve done this quite a few times and students can come up with something plausible for about 80% of cases – and that when just sitting in class rather than when truly motivated to find something to get a manuscript accepted at JPSP.

        • Mark, while it is true that almost can find a plausible explanation that could be framed as a priori, researchers cannot make up prior theoretical papers that argue for that position. When such paper are there in the literature, this should be taken into account in judging whether a given prediction is a priori or post hoc. In this case the papers were there in the literature arguing for exactly the type of interaction that Cesario, Plaks, and Higgins tested. To me that is strong evidence that the interaction prediction was a priori and careful scholarship should investigate whether such theoretical papers exist and should take the existence of such papers into account when trying to judge whether a particular prediction is a priori or post hoc. This was not done in this case. The blog doesn’t even recognize that Bargh’s writing on priming is substantially different from Higgins’, and seems to expect that despite their different theories about priming they would make the same predictions in every case. In response, I am just trying to say, “read the theories before you make a judgment about whether a prediction is a priori or post hoc.” I think if you do you will see that both the Bargh prediction and the Cesario, Plaks, and Higgins prediction was a priori.

        • Steve:

          The prediction in the Pashler et al. attempted replication of Bargh et al. was definitely a priori, and we know how that turned out. Regarding Cesario et al., let me repeat that the general theory (varying effects) was a priori, but the particular analysis was not (see Erikson’s comment elsewhere in this thread). And, again, Cesario et al. in their paper never claimed to have made their data processing, analysis, and presentation decisions ahead of time. Their analysis was contingent on the data. There should not be anything controversial about this, and it’s not an accusation to point it out. The only concern is that it makes the p-values essentially meaningless—but this is a point that until recently was not widely understood.

          Recall also that Bargh in his post claimed the Cesario et al. result as a successful replication of his finding, even though Bargh’s main effect did not (I assume) appear in Cesario et al.’s experiment, nor did Cesario et al.’s interaction appear in Bargh’s experiment. To me, the sad part of all this is seeing researchers jerked around like puppets on a strong based on random patterns from N=40 experiments. One could spend an entire career doing this. As a statistician, I feel it’s my duty to explain how this can happen, so future researchers don’t make the same mistake. It’s easier to communicate this to people who don’t have a personal or professional stake in the old system. (Recall the devaluation-of-the-currency analogy.)

        • To me, “a social psychologist who knows the literature” is by definition an insider.

          My suggestion was that you find a smart non-Soc. Psy. researcher you respect & bounce off some of these studies off them.

        • Steve Spencer:

          “Activation is moderated by individual differences and motivational states” is a statement so general as to be close to meaningless.

          The point is that different published studies in this area look at different interactions. The original Bargh, Chen, and Burrows paper from 1996 looked at no interactions at all! Why do you believe their N=30 positive result (despite its statistically significant p-value being subject to selection effects and thus not interpretable at face value) and not the N=66 failed replication?

          And, if these interactions are so important and so theoretically motivated, why did Bargh, Chen, and Burrows not include these interactions in their own study? And if the interaction with factor A was so theoretically motivated for study A, why was that interaction not included in study B? And why was interaction B not included in study A? Why did not all these studies include both interactions in their analyses? And what about other theoretically motivated interactions such as sex, age, socioeconomic status, etc?

          The problem—and your comment illustrates this—is that the substantive theory is so general that it is admits endless numbers of statistical hypotheses, any of which you can take as confirmation of the general theory. That’s fine—but then any claims of statistical significance in these sorts of studies are meaningless. Researchers have, in the words of Uri Simonsohn, essentially infinite “degrees of freedom” or permutations of the data that will count as statistically significant.

          The term “whack-a-mole” is a perfect description of what is happening here. No study in this area never needs to fail. Miraculously, studies of tiny effects with complicated interactions keep finding statistically significant. Why? Because statistical significance is the researchers’ goals, and they have enough degrees of freedom (choices of how to characterize and exclude data, which comparisons and interactions to look at, and so forth) that they can regularly attain statistical significance—and all this is completely consistent with the wonderfully vague theory that “activation is moderated by individual differences and motivational states.”

          Again, I’m not suggesting any malign behavior on the part of these researchers: rather, I’m suggesting that they’ve been playing by the rules. It’s just that the rules don’t work the way you think:

          You think that statistical significance is a tool for discovery: first Bargh et al. discovered the main effect, then follow-up researchers discovered important interactions, then various failed replications actually represent new interactions that had not been recognized by the experiments, and so on. An endless parade of discovery. Actually, though, this entire pattern can occur from pure noise, as explained by the statistical work of Simonsohn, Francis, and others, and as demonstrated by various unsuccessful preregistered replications.

          Given your remarks about “lazy scholarship” and “anyone who knows the literature,” I can’t imagine I’m convincing you of anything. I’m basically coming to you and telling you that all that currency you’ve been spending and collecting, is not actually backed by gold: it’s paper money and the government backing it is collapsing.

          But I do hope to be reaching the younger researchers who would like to do better. When I give talks on this material to psychologists, students and postdocs come up to me and ask what they should be doing. They are often interested in reforms of the process of publication and diffusion of science.

          Moving away from the paradigm of routine discovery: that’s scary.

        • You seem to not understand that Bargh and his colleagues have their point of view–priming is often a main effect–see his cognitive monster theoretical paper, and Higgins and his colleagues have their point of view–what is important about priming is that it varies with motivation and individual differences and they have specific ideas what motivations and individual differences matter. Each are testing their own previously articulated ideas. I see no evidence there was any use of endless degrees of researcher degrees of freedom here. You make the clearly advance the hypothesis that the Cesario, Plaks, and Higgins finding was post hoc. I say the evidence in the literature is that it was a priori. That of course matters. They were not playing whack a mole, they were testing there theory and that could have been known, and can’t you see if they were across the table from you and you said, “in that paper you were just playing whack a mole,” that might be seen as uncharitable and condescending? That is the way it looks to this outsider. You can raise your point about experimenter degrees of freedom, but the charges about individual papers is misguided. Personally I remain unconvinced by the main effect interpretation of most subtle priming effects (less so of more overt priming effects), but I think that interaction priming effects are much more likely to be robust. We will see in time, but dismissing all attempts at moderation as whack a mole approach I think is unfair and does not pay enough attention to the theories that have been developed.

        • Spencer defense of Cesario et al. (2006) is not justified, in my opinion. A close reading of the analysis clearly points to use of researcher’s degrees of freedom and p-value hunting.

          In experiment 1:
          “Gender had no main effect or interactions […], so it is not discussed further” (p. 897). The “theory” didn’t take into account possible gender effects. They also didn’t find any significant gender main effect or interaction, so they set it to zero. But it means that, even if the “theory” didn’t predict it, they checked for it and found nothing — possible sighing in relief. If they found some effect, they would probably build a compelling narrative about it. Shouldn’t gay men elicit more hostile reaction in straight men?

          The first analysis is odd. The omnibus ANOVA did not show a significant effect for prime condition, but the analysis did not include any other possible confounder. They could have stopped here and concluded “no effect”, but they state that they found significant differences in “planned contrasts”. The measures are highly skewed, as is usual with psychological scales, but it is not taken into account, also as usual. Even so, the first contrast (control vs gay prime) is not significant (p = 0.053), even though they report it as 0.05. The other contrast (straight vs gay prime) is significant, but how do you make sense out of those results?

          In experiment 2:

          Again, omnibus ANCOVA of walking time on prime condition with entrance walking time as covariate was not significant (or “near-significant”). Some degrees of freedom in F tests are off by one, which means that the fitted model is not exactly the one implied in the text. This is also common in some psychology research: models are not made explicit, only the statistical tests. I guess it doesn’t help our lack of mathematical training. The size of the effect is quite small (a little more than half a second), so we can only hope that the person who measured the time has little error in pressing the chronometer button.

          For the regression models, they claim that the measures of implicit attitude are orthogonal, but how can two measures of positive and negative attitude be independent? And why put a graph of regression lines without the points, specially with such small sample? They fit two smaller models (one for each priming condition) and one full model, but they change the predictors for the full model, specially the implicit attitude measure and group dummy variable. What justifies such choices?

          In experiment 3:

          The predicted interaction was, once again, “near significant” (p = 0.065). So they search for more using pairwise comparisons, and found one that is significant under one-tailed p-value (p = 0.04), which means it is truly 0.08 in a two tailed test.

          Well, I guess this is enough evidence that they were hunting for significance, and found it no matter what.

          All this discussion reminds me of this blog post:
          https://hardsci.wordpress.com/2016/02/11/an-eye-popping-ethnography-of-three-infant-cognition-labs/
          which is about this paper:
          http://srd.sagepub.com/content/2/2378023115625071

          In summary: small samples, noisy measures, multiple comparisons and statistical significance as the golden standard is a great recipe for bad science.

        • Erikson,

          I am not arguing that all was as it should have been in 2006 in the editorial process–it was not and there was too much emphasis on statistical significance and a threshold for publication in my view. Still if you are going to consider evidence for p-hacking or use of too many researcher degrees of freedom then you also have to consider evidence that argues against such an interpretation. I find it quite telling that you do not consider any evidence against this interpretation and only evidence in favour of it.

          So first let me note the evidence against p-hacking or too many researcher degrees of freedom. First, and importantly it is easy to derive the predictions of this study from previous theorizing by Higgins that was published at least 10 years prior to this research. Second, several important findings are reported at near significant levels but were not hacked to p < .05. The omnibus test of priming in Study 1, the interaction in Study 3 are places where a finding just under p = .05 would be a lot more suspicious. Third, it is clear that the covariate in Study 2 was planned ahead of time and collected before the manipulation. No other covariates are used in any of the studies. Again, opportunities for p-hacking not taken. Fourth the planned contrast do fit very well with long preciously articulated theory.

          What is the evidence for p hacking or use of researcher's degrees of freedom. You say they analyzed for gender and found nothing and reported it. Well, at this time and in this journal it was typical for reviewers to ask if findings had been analyzed for gender. Some were quite concerned that people were not considering the possibility of gender differences in results and that something very easy to examine was not being examined. Said another way this was typical practice. Perhaps not the best practice, but in my view no indication of p hacking or use of researcher's degrees of freedom. Rather doing an analysis to address concerns of others in the field.

          That they conducted planned contrast is in my view an explicit claim these effects were predicted ahead of time. If these claims are legitimate then the analyses are legitimate. As I argued above I think they are legitimate based on the theorizing that was in the literature. Of course .053 rounds to .05 and of course .053 was almost certainly also a rounded value. Now we can talk about the reporting of precision of p values and I agree more precision could have been used but at this point in the literature .05 was the typical precision reported. I see no reason to believe that the authors just doing what was the convention at the time is an indicator of researcher's degrees of freedom.

          I do have no idea why the degrees of freedom are off in the Study 2. This is of course a problem and should have been fixed before publication. That it means researchers degrees of freedom were being used, however, is to me inconclusive at best. It could have been a transcription error or the elimination of one participant for whom the data wasn't collected (or course this should have been reported) or a whole host of other things than researcher's degrees of freedom.

          I do not understand why Erikson is surprised by the orthogonality of implicit attitudes toward the elderly and youth. They are not two measures of positive attitudes and negative attitudes, but rather measures of attitudes toward two different attitudes objects–old people and young people. Is it really surprising that many people love both old people and young people and other hate both old people and young people and it is about equally likely that others love one and hate the other? With regard to putting regression lines without the points this again was typical of the day and less than optimal, but following the typical practice of reporting analyses is not evidence of researcher's degrees of freedom in operation. I am also surprised that Erikson has any difficulty in understanding why the authors chose the variables that they did for the full regression analysis. Including the dummy variables was necessary when all the date were included and obviously would make no sense when selecting just one condition at a time. Including both positive and negative attitudes made sense when testing the smaller models without interactions but the really wasn't enough n to test both dummies with the interaction of all four implicit measures if they kept negative and positive attitudes separate.

          I also find the weak evidence reported in Study 3 as evidence against use of researcher's degrees of freedom.

          So, all in all, I see more evidence against use of researchers degrees of freedom than consistent with it and I think the difficulty of only examining evidence consistent with use of such tactics (and not evidence against it) should be obvious. Such a biased search for evidence can really only lead to one conclusion.

        • Spencer,

          You are right, I have only cherry-picked problems while reading the paper. That’s a bad thing to do, for sure, given that there are no perfect papers and, as much as it has many problems, it certainly has some qualities, too. I also mentioned aspects of the paper that are problematic per se, not because they indicate abuse of researcher’s degrees of freedom (like criticizing the regression plot or the gender test).

          I’ll try to reframe my criticism in a not so extreme way. But first, I have not mentioned that the authors have p-hacked. I said that they hunted for significance, using multiple comparisons and operating on gray areas to make the evidence look better for their theory. I don’t mean to imply that the authors are frauding their results, only that they are abusing their degrees of freedom to better support their hypotheses.

          For experiment #1:

          Prediction: “[…] increased hostility following gay prime as compared with the straight and control primes.” (p. 897)

          Evidence: Non-significant ANOVA for main effects (p = 0.08);
          Non-significant (planned) constrat between gay and control primes (p = 0.053) – but reported as significant!
          Significant (planned) constrat between gay and straight primes (p = 0.047).

          Conclusion: “Study 1 provided evidence that priming social categories can lead to behavior opposite of those traits associated with the stereotype.” (p.898)

          The authors are concluding that two non-significant tests and only one significant comparison (that was not corrected for multiple comparisons, despite it should) are evidence for the predicted hypothesis! They are not faking the results or p-hacking, indeed, but they are using qualifiers such as ‘near significant’ or misreporting (rounding) p-values to make it look like that the experiment provided evidence for the predicted effect.
          Also, why not stop at the non-significant ANOVA? Planned constrasts are usually applied when the ANOVA results are significant.

          For experiment #2:

          Prediction: “[…] test the role of attitudes as a motivational underpinning in automatic behaviors” and “implicit measures […] will relate to automatic behaviors even though an explicitr measure does not” (p. 899).

          Evidence: Non-significant ANCOVA for main effects (p = 0.06)
          Significant pairwise comparison between elderly and youth primes (p = 0.02) (not reported as planned)
          Significant coefficients in four small regression models.
          Two significant interactions in a full-regression model.

          Conclusion: “The predictions of Study 2 were confirmed” (p. 903).

          Indeed, thing look better here. The main effect is not significant, which means that Bargh findings are not replicated. But then they compare elderly and youth prime groups in another ANCOVA (why not use planned contrasts, as before?), discarding the control group information, to finally find some significance. They could have used a lot other tools, too. Difference-in-differences, simple ANOVA, etc… They are not cheating, but they are discarding information and re-running the analysis to obtain another result — a clear example of abuse of researchers degree of freedom.

          The small regression models use the “orthogonal measures” separated by prime group. The results seems to be OK with the main hypothesis. But the full model gives non-significant results. First, they define a new attitude measure by “subtracting each participant’s positivity score from the negativity score”. This means that Anew = Aneg – Apos, or so I understant it, which means that negative scores indicate more positive attitude. But the authors claim the reverse! “Higher numbers indicating more positive attitudes” (p. 902). Then, they report the significant interaction without the main effects for comparison, and quickly indicate that “more important, tests of simple slopes […] were significant in the predicted direction” (p.902). I assume “simple slopes” means main effect + interaction for each group, because the degrees of freedom remain constant in each test. But the only significant ‘single slope’ is for increased time in the elderly prime group with higher attitude score; for the youth prime and youth attitude, the effect is not significant (p=0.065), but it is interpreted as significant anyway. Why not report the full model in a table, instead of pointing out the significant coefficients?

          The “orthogonal measures” aren’t only for elderly and youth! There are four measures: positivity towards elderly, positivity towards youth, negativity towards elderly, negativity towards youth. I could accept an “orthogonal” measure of elderly and youth empathy, but how can those four measure be indepedent of each other? I understand that Psychology has a lot of crazy scales for all sort of things, but to assume orthogonality between such related constructs is not acceptable.

          The final line is that evidence in study 2 is better than in study 1, I’ll give you that, but there are so many unjustified analysis decisions that we can’t help but think that the authors did different analysis until they found significance. And they did work with the gray area of ‘near-significance’ to make the data more supportive of the predicted effects.

          For experiment #3:

          Prediction: “[…] priming ‘elderly’ activates a goal, […] writing about […] elderly represents fulfillment of that goal, which should lead to inhibition of the category accessibility” (p. 904)

          Evidence: Non-significant two-way interaction in ANOVA (p = 0.65).
          Non-significant post-hoc comparison (p = 0.08), reported as significant one-tailed test.

          Conclusion: “[…] after activation and subsequent satisfaction, accessibility decreased.” (p. 906)

          Again, the first test is negative, but they keep going anyway. No predicted constrasts are mentioned, and the comparisons between groups is not a contrast of the bigger ANOVA model, but just a pairwise t-test comparison between groups which in given as an one-tailed test. Each experiment used different analyses to compare groups, an now even a one-tailed test is used to report significance. But the authors made it look like the data supported their hypothesis, despite lack of significance.

          The bottom line is that, although we can’t say that the authors p-hacked or intentionaly twisted their results, they clearly have subpar evidence that they make look like conclusive in favor of their theory. This is possible because they abuse in their interpretation of the non-significant p-values, calling they ‘near significant’ when they favored the theory. It’s a clear case of abuse of researcher’s degrees of freedom, even if the hypotheses were not proposed post hoc.

        • What I find so odd about most of the priming (and power pose) literature is the insanely small sample sizes for many of the original studies. Who sets out to study a phenomenon that most social psychologists would have – at best – expected to a small effect size – and designs a study with N=30 (or N=42 for Cuddy’s original paper)? There must have been some peaking at the data (i.e. p-hacking). Is this how the students of Bargh and Cuddy are being trained?

        • Mark:

          As I wrote in another recent post:

          you take a research area with small and highly variable effects, but where this is not well understood so you can get publications in top journals with statistically significant results . . . this creates very little incentive to do careful research. I mean, what’s the point? If there’s essentially nothing going on and you’re gonna have to p-hack your data anyway, why not just jump straight to the finish line. Chatterjee et al. could’ve spent 3 years collecting data on 1000 people, they still probably would’ve had to twist the data to get what they needed for publication.

          And that’s the other side of the coin. Very little incentive to do careful research, but a very big incentive to cheat or to be so sloppy with your data that maybe you can happen upon a statistically significant finding.

          Bad bad incentives + Researchers in a tough position with their careers = Bad situation.

        • Sorry Andrew, but I find this whole post pretty condescending. When have I said that statistical significance is a tool for discovery. I haven’t said that. I don’t believe it and it is not what I think despite your ascribing it as my belief. II have been arguing for three major points: 1) there is a difference between a priori prediction and post hoc data analysis; 2) knowing the literature can allow people to judge at least to some extent whether a given analysis is a priori or post hoc; 3) to fail to do your homework on knowing literature in evaluating an analysis thus not fully considering whether it is a priori or post hoc is lazy scholarship. I think all three of those points stand uncontested ,and I think they indict your analysis in this blog as lazy scholarship.

          I would further add that these or similar distinctions are necessary to properly use your whack a mole analogy. I realize that researcher degrees of freedom can be horribly abused. I realize that our research practices, analysis practices, and interpretive practices were not and still are not all that they should be. I have read (and assigned in my graduate methods class last week) the Simons, Nelson, & Simonsohn paper in which researcher degrees of freedom were able to create ridiculous effect. I know these limitations. Still I think it is important to distinguish between people who were testing their a priori hypotheses and followed the practices of the day and those who churned their data set testing post hoc hypotheses and churned the data set and maybe even the collection of data to obtain some finding, whatever it may be that they could develop a paper around. In my mind these are two very different ends of a continuum that characterized the practices in the field and in many respects still do. I further think that whack a mole can be used as an analogy to the latter end of the continuum but not the former. I would also add that with knowledge of the literature you can and should begin to place past research on this continuum and with thoughtful scholarship the Cesario, Plaks, and Higgins paper would be placed on the non whack a mole end of this continuum.

          In my view we have to be more discerning in evaluating the literature. We could simply say all the past research was shit and all of it was whack a mole and we need to start over entirely. I believe your analysis would lead to such a conclusion whether you have understood that or not. No paper including everything published before 2011 and all but a very small handful published since then would not stand up to your analysis. They do not make the claims you are looking to see in the Cesario, Plaks, and Higgins paper. Everything would be seen as whack a mole. In my view you are using a sledgehammer in your critique when the careful brushes and hammer of an archeologist are what are needed.

          To throw away all the research that has been done in Social Psychology before 2011 and all but a handful studies since then would in my view be a huge mistake. Is there too much noise in the research? Certainly. is there no signal at all? Almost certainly not. So as we sift through this research those of us in the field are going to have to apply careful scholarship in trying to figure out what is more likely to be signal and what is more likely to be noise. That will include careful consideration of the theories that were being tested for what is a priori and what is post hoc. It will require careful consideration of not just evidence of whether researcher degrees of freedom have been implemented, but also evidence that they were not implemented in some analyses. Sure in almost every case the p values will be skewed to some extent, but is there useful signal in this research that should still be considered in the development of the field? In my view this is the tough and dirty work that needs to be done. Of course we could disregard everything, but in my view that too would be lazy scholarship.

          Andrew, what I find most offensive in this last post is the almost evangelical character to it. You are out to convince me and when you can’t imagine convincing me you dismiss me and move to convince younger researchers. There is no real engagement with my arguments. No real reconsideration of your point of view. No acknowledgement that knowing the literature would potentially be valuable in evaluating past papers. Instead I see evangelical zeal to convince people and when you don’t think you can you move on to convince the next person.

          I do believe in open dialogue and I hope we can continue that here, but perhaps this isn’t a place for open dialogue and instead is a place for recruitment to the cause. It is true that I am not yet convinced by your arguments in this blog, but I am trying to have an open mind and I do think critics of our practices have taught us many useful things. I know I have changed the practices in my lab. I just hope the critics, such as yourself Andrew, will be willing to listen as well. So far, I have not seen that attitude.

        • Steve, I agree with you that Andrew may be making too extreme a case here.

          However, I am very skeptical about your own understanding of the statistical issues. That’s because my priors have been

          I got curious about what your interest in priming was and why you were defending this line of work. Now, I’m not sure if you are the same Steve Spencer as at Waterloo, but if it’s really the same person, then your work on priming has the same kind of problem that Andrew keeps pointing out again and again. Even if this is some other Steve Spencer, the paper on priming has this weird stuff that raises hackles:

          Strahan, E., Spencer, S. J., & Zanna, M. P. (2002). Subliminal priming and persuasion: Striking while the iron is hot. Journal of Experimental Social Psychology, 38, 556-568.

          Quoting p. 559:
          “This ANOVA revealed a main effect of subliminal priming condition. Participants
          who received thirst-related primes drank significantly more liquid than participants who received the neutral primes (F(1, 73) = 4.05, p < 0.05). Although this analysis did not reveal an interaction between thirst condition and subliminal priming condition that reached a conventional level of significance (F(1. 73) =2.06, p = 0.15), as can be seen in Fig. 1, simple main effect analyses revealed that the significant main effect was primarily due
          to the fact that participants who received the thirst-related primes drank more than participants who received the neutral primes when they were thirsty (F(1, 73) =
          5.60, p < 0.05). In contrast, when participants were not thirsty, the subliminal priming condition had no effect (F(1, 73) < 1)."

          Why do people keep drawing conclusions about interactions based on nested comparisons? The reference to conventional levels of significance is well-taken; but this phrase is being used to mentally drag down the p-value to an accepted level of significance when it isn't. Either you buy into the NHST paradigm or not. You can't have an algorithm like (a) p0.05, well, almost significant. This is just monkeying around with the result to make the point you’ve decided in advance you need support for.

          Even if your overall criticism is right about Andrew just superficially criticizing something he doesn’t understand, and I think there may be some truth in what you say since you are an expert in the area and have a broader understanding of the literature than Andrew, I would tread lightly when it comes to talking about statistical inference, given the absurdly low standards of statistical understanding that research on priming—as exemplified by the above quote—displays. Maybe you are willing to acknowledge that many researchers in this field don’t really understand what they are doing when they do statistical inference using p-values? How many bogus conclusions like the one quoted above are floating around in the literature on priming?

        • “II have been arguing for three major points: 1) there is a difference between a priori prediction and post hoc data analysis; 2) knowing the literature can allow people to judge at least to some extent whether a given analysis is a priori or post hoc; 3) to fail to do your homework on knowing literature in evaluating an analysis thus not fully considering whether it is a priori or post hoc is lazy scholarship.”

          From Cesario et al. (2006), p. 908:

          “It should also be noted that there are possible alternative accounts for the findings of any one of our studies, some of which can be applied to prior automaticity work as well.”

          This maybe a stupid question, but if there are alternative accounts for the findings available in the literature, how exactly can one judge whether a given analysis was done a priori or post-hoc?

          (You could just look at your results, and then pick the theory which best accounts for your findings and use it to build your case for your proposed hypothesis in your writing. It may seem a priori then, but it really is post-hoc)

        • Steve:

          The simple fact is that (a) when outsiders tried to replicate the Bargh et al. study with a larger sample, the replication failed to produce the anticipated effect; (b) the statistical theory of researcher degrees of freedom explains how Bargh et al. could easily obtain statistical significance even in the absence of any such affect; and (c), despite (a) and (b), Bargh refused to admit the possibility that his original finding was spurious.

          You point out that a later study achieved statistical significance with a new interaction, not considered by Bargh, and yet another later study achieved statistical significance with yet a new interaction, not included by Bargh or by that earlier study. You can feel free to believe these patterns are real, just as Amy Cuddy can feel free to believe that she found real effects, despite having done a small-sample noisy study in which, from a statistical perspective, there’s just about no chance of finding anything useful, just as Daryl Bem can feel free to believe that the particular interactions he found to be statistically significant in his ESP study (but were not considered in other papers that he considered to be replications) were real, just as Satoshi Kanazawa can feel free to believe etc etc etc. All these people have theory. Theory’s fine. The data all of you are presenting though, not so impressive.

          I’m glad you’re trying to have an open mind, and I’d suggest that you consider the possibility that the claims of many published papers in these journals are spurious. Cos that’s consistent with (a) statistical theory and (b) the continuing failure of replications.

          It’s not a message that you might want to hear, and I’m sorry, but I’m just the bearer of bad tidings here. The real problem is coming not from Simonsohn, Francis, Wagenmakers, Button, me, etc etc., but from the people who sold you and your field on these methods, the people who sold you the message that enduring truths could be learned by studying small, highly variable effects with N=30 experiments.

          Let me put it one more way, and I hope this helps. Just suppose that none of the studies in question had statistically significant results. Suppose the published p-values in the studies we were discussing were not .03, .04, .02, etc., but rather were .3, .4, .2, etc. Would you still believe these effects? I think you should. The theory would still be as strong as it ever was, and the data would still be consistent with no effect.

          I have no problem with theories such as “activation is moderated by individual differences and motivational states”; my point is that such theories are, quite properly, quite general and can predict many different possible quantitative outcomes (for example, this particular theory can predict that slow-walking-from-elderly-priming would occur more among young people who like the elderly, or less among that group, or even that one or the other of these groups would walk faster when primed with elderly-related words). All these potential patterns are possible consequences of the theory, and indeed I believe that all are true—in different circumstances. But I don’t think the data from these small-N, highly variable studies are telling us anything. That’s the news that I, as a statistician, can convey to you. You can feel free to resist and you and others can continue to do Cuddy-style small-N studies of highly variable effects, but you’re fighting mathematics here. You’re trying to do the statistical equivalent of faster-than-light-travel or a perpetual motion machine. I might not be expressing my views in the most polite way, but believe I am sincere and I really hate to see people wasting their time trying to do what can’t be done.

        • Until the field of social psychology has done the sifting it is probably best to treat everything in the field with extreme skepticism though. If you tell me that some non-trivial proportion of apples in your basket are poisoned I will decline your offers to taste the apples even if many of the apples are perfectly safe and lovely.

        • I think this discussion has reached a very interesting point. I, too, am tempted to dismiss an entire field of research, particularly if it is not my own. The essential point of argument appears to be the value (or lack thereof) of doing small studies where there are many many moderating effects that have not been included (and perhaps which cannot be measured), and further, whether it only makes sense to do such studies if they are preregistered (or use some other method to ensure they are a priori based on a theory to be tested).

          This problem is not unique to social psychology. It happens in medicine all the time. It also happens in macroeconomics (a field I am more familiar with). For numerous reasons, small scale studies are the norm in these areas – due to costs (not just monetary). And, for numerous reasons, studies will continue to be done despite the weak underpinnings – because the results are just too important. We need to do small scale medical studies, and if they are promising, they are followed by larger scale more definitive work. Yes, they will be over-hyped and misinterpreted, but nobody is suggesting that we do away entirely with small scale studies.

          In macroeconomics, people study things like the impact of government debt on economic growth. The sample sizes are small – after all, the subjects are entire economies and often the treatment effects deal with discrete relatively rare events (such as financial crises). The number of moderating influences, which cannot easily be measured, are large – cultural differences between countries, historical differences, demographic differences, etc. I believe such studies will continue – and should, given the importance of the subject. I myself do not put much stock in any of the findings, given the small samples and number of potential moderating influences.

          So, is social psychology and different – in some way that we can say that such studies are not worth doing? I try to think about what we might learn from these studies. In medicine we might learn that a new drug is potential helpful – or hurtful. That is worth knowing. In macroeconomics, we might learn that certain government policies enhance or inhibit economic growth. This is certainly worth knowing. In both cases, the small sample sizes and unincluded moderating factors mean that any study is just a first step whose results should not be overplayed. And, in both fields, the ugly head of academic politics means that too many weak studies will be published, too many errors go uncorrected, too many replications not carried out or published, and too many results will be over-hyped by the media. We should try to address these negative factors, but I don’t think anybody is suggesting we stop doing the studies.

          I’m not so sure about social psychology – use priming as an example. I do think the results of such studies are worth something, even monetarily. Imagine that we can prime a presidential debate audience to respond more favorably to a candidate – that is certainly “worth” something to some people (although it may not be in the public interest). But I am not sure we can hope to generalize the results at all. If we can prime college students to walk more slowly, even if we were confident in the results, does that translate to being able to prime a debate audience? It is not clear to me that any of these studies are translatable. Power poses may turn out to be a real thing – I can easily believe that – but does their demonstration in a particular experimental setting translate to their applicability in any other setting?

          I feel as though it is my lack of belief in that field that has led me to not work in it. Similarly, I do not work on macroeconomic issues (I even refuse to teach the subject). I suppose that social psychologists have more faith in their ability to find important things that can be translated into other settings. Researchers probably self-select into field in which they feel this way – and avoid working in fields they feel are not capable of being both important and producing replicable reliable findings. Perhaps we are just seeing that self-selection play out in terms of the comments being offered in these posts.

          The one thing I think should not be lost is the importance of improving academic incentives and work habits. If anything, it underscores the need to mitigate the influence of publish or perish incentives. It points to the need for more cross-disciplinary team based work (which is unfortunately discounted in academic hiring and promotion decisions). And it points to the need to buttress our defenses against the growing pressure to find “significant” effects that are clear and unambiguous as technology continues to compress time and attention.

        • Dale, I agree with much of what you write but I think that social psychology is fundamentally different because the field’s small sample sizes are typically not driven by cost considerations (the cost of a larger sample size are often trivial) but by the rampant practice of p-hacking. I remember sitting in a faculty meeting back in 2011 when the Simmons, Nelson, and Simonsohn paper got published and was discussing its depressing implications with a friend. A very big name in social psychology overheard us discussing the p-hacking phenomenon and asked us to explain why the practice was so problematic because that was how he had been taught to do research (by an even bigger name) and how he approached all of his work. I am not sure if we were ever able to persuade him to change his practices but an examination of p-curves and the very low sample sizes in many areas of social psychology leaves me with the distinct impression that these findings not only do no good but cause active harm by wasting grant money, faculty positions, and journal space and leading intelligent and motivated graduate students into pursuing research into these areas.

        • Shravan,

          First let me that yes, I am currently at the University of Waterloo and yes I was second author on that paper that used priming and for full disclosure I used priming in another paper with Steve Fein and our colleagues about the same time. Those are my only two papers in which I have used priming and I haven’t done a priming study for over 15 years. I have a passing interest in priming, but it is far far from my main interest and in those two papers I just used it as a tool to examining what I was really interested in–models of persuasion in the Strahan, Spencer, and Zanna paper and stereotype activation in the paper with Steve Fein and our students. Your post seems to imply I am a priming insider. I most certainly am not. It seems to imply that my taking up this topic is a self-interested defense of the findings from one of my earlier papers. Let me assure you that is not the case. I know full well that the null hypothesis testing in that paper and really all of my previous papers would not hold up to the standards that are being set forth here.

          What I find strange and patently unfair is that you would question my understanding of statistics based on a paper written 15 years ago. Have you learned anything about statistics in the last 15 years? I would hope so, I know I have. I also commented earlier that I have learned from recent critique and I have changed the practices in my lab. None of that seems to matter, however, instead everything I say will be dismissed because of a paper I wrote shortly after graduate school. I hope you can see how this standard too, argues for dismissing not only all the work that was done previously, but seems to permanently taint anyone who was involved in this work as lacking in understanding of statistical issues. You ask why do people keep drawing conclusions about interactions based on nested comparisons? I don’t. I did when I was a young scholar as that was the practice of the day. To be fully honest I didn’t buy into the NHST paradigm then and I certainly don’t now, but that was the paradigm that journals demanded.

          Now I hope we can get beyond the mud slinging and I am deemed worthy of participating even though I published articles in the past under a different paradigm and I have learned better ways of doing analyses. With that in mind let me say why I am participating in the discussion. I think the big issue at this point is what we do with all the past research in Social Psychology. Do we dismiss it all? That seems to be the implication of Andrew’s blog even though he doesn’t fully seem to realize it. I think it is important that we don’t do that and instead we carefully evaluate past research noting not only its weaknesses but also its strengths.

          We have faced similar issues in the past and we have done better. As an example, many of you may know of the classic studies by Lewin on democratic vs. autocratic vs. laissez faire leadership style. The actual analyses of these studies were fraught with difficulties including big issues of non-independence of the observations. It simply has no hope of standing up to a modern or even 1970s evaluation of the data. Is it still an important study? I think so. Should Lewin be treated with respect (and I would argue even awe in Lewin’s case) despite the weaknesses of his data in this and all his studies? I would argue yes, we should.

          So, it is my hope that as people evaluate past research they don’t pull out the sledgehammer and as soon as they find any practice that is not acceptable by todays standards they dismiss it and the researchers who conducted it. Instead, I hope they will be thoughtful and careful evaluators of the evidence and consider strengths and weaknesses (not just weaknesses) and consider those who have conducted these studies worth of having a discussion.

        • Steve:

          It’s hardly “mud slinging” to point out errors in a 15-year-old published paper. Assuming the paper has not been retracted, it’s still part of the scientific literature.

        • @Steve

          I’ve one, perhaps naive, question:

          Can you elaborate on, given a past study, how you are going to distinguish between the thoughtful people who were testing their a priori hypotheses vs the whack a mole data-churners & fishing fans?

        • Andrew I can agree that pointing out a 15 year old error is not mudslinging, but perhaps you can agree that pointing out that someone reported a statistical analyses in the typical way that such analyses were reported 15 years ago does not really warrant questioning the person’s knowledge of statistics. It was that questioning of my knowledge of statistics that I found a condescending mud sling. It most definitely was personal and it carried the implication that what I had to say could wasn’t worth listening to.

          That is fine with me, if the people here at this blog do not feel that I a worthy to be part of the discussion I can live with that. I do think I raise a significant issue, however. It has become clear to me that you Andrew believe that all of the field of Social Psychology, or nearly all of it should be dismissed. Am I wrong about that? In contrast, I think there is much valuable in past research even if it is hidden in noise of small studies much of the time. I do expect we will know in time. More powerful studies will become the norm in social psychology, and we will have less noisy signals.

          The trap you seem to have fallen into Andrew is that because researcher degrees of freedom may be the cause of almost any effect in social psychology, you seem to assume that they were indeed the cause of that effect. When multiple potential causes for an effect are possible you privilege and seem to take as definitive that researcher degrees of freedom are indeed the cause of the effect.

          Take for example the Bargh, Chen and Burrows walking slowly after elderly prime effect and the Pashler “failed” replication. When you examine this pair of studies you seem convinced that the Bargh results are caused by experimenter degrees of freedom and the Pashler results are accurate. In contrast, I see at least three compelling explanations and I am not at all sure which is correct. I think they are all plausible. One explanation is the one you seem to prefer Andrew. I believe a second explanation that is possible, however, is that the true effect size of the priming manipulation in these studies is small to moderate and that both results are within the confidence interval for this small to moderate underlying effect size. Correct me if I am wrong, but do not the confidence intervals for the effect size of the manipulation in these two studies overall? Isn’t it possible that the effect size underlying effect size lies in this overlap and that both results are within the kind of variation we would expect by chance? Finally, there is a potentially important methodological change in the Pashler study from the Bargh study. Pashler had the RA rush up behind the participant as they were walking during the DV collection and Bargh simply measured the time it took to walk the distance without any other people around. I understand the Pashler probably had to institute this variation to meet ethical requirements the the participants were fully debriefed and as they were measured walking to the exit door, if they had waited to rush up to the subject until they were at the exit then they likely would “lost” some participants and would not have been able to debrief them. Still in my view, it is possible that the RA rushing up behind participants may have significantly disrupted the speed at which people would have naturally walked and substantially interred with the collection of the walking speed DV. I think all three of these explanations are possible and I don’t think it makes sense to jump to any one conclusion at this point. More testing would be needed.

          Similarly, I find the recent replication of Cuddy’s power pose research leaving the status of that effect much more equivocal than people seem to be taking it. Personally, I was very skeptical of this research and generally almost all embodied cognition research. I would not have been surprised at all to see a failed replication. When I look at the data, however, I see some clear replication and some clear failure of replication. Let me explain. Now the theoretical model that Cuddy has advanced is that adopting a power pose leads to feelings of being powerful which in turn leads to lots of positive things including increased testosterone and risky decision making (why these are supposed to be good things still eludes me). Now the recent large n replication finds strong evidence that adopting a power pose leads to feeling of being powerful. I was shocked by this finding as I had always guessed that adopting a power pose would lead to feeling of awkwardness–I still assume I would just feel like a dork. But this effect is quite solid in the replication and the confidence interval for the effect size although still pretty big even in this larger n study is nowhere close to zero and suggest the effect should be at least moderate in size. In contrast the effect of the power pose on testosterone and risky decision making is incredibly weak and I assume although it wasn’t reported that the effect of feeling powerful on testosterone and risky decision making is also weak. Now what do we make of this replication? We can say that there is nothing to adopting a power pose–my personal a priori belief–or we can say it seems that adopting a power pose seems to have decently strong evidence that it leads to feeling powerful, but the evidence that it affect testosterone and risky decision making is pretty not strong at all. It is possible that the original study obtained effects on testosterone and risky decision making by researcher degrees of freedom, but the simplest explanation for the effect of the power pose on feeling powerful in that study was that the power pose creates these types of feeling. The replication even ruled out demand as an explanation of that effect which could not be ruled out in the original study. From my perspective the effect of the power pose on testosterone and risky decision making isn’t all that interesting. It suggests to me that power pose researchers would be will served to look at other DVs–frankly I suspect that testosterone and risky decision making were chosen as DV because Cuddy works at a business school and they care about such DVs. It may be that feeling powerful after a power pose is strictly an epic-phenomenon in much the way behavoirists thought all thoughts were, but it might be the case that feeling powerful will have important consequences. I think both possibilities are plausible, but I don’t think it suggests that this replication suggests that the power pose was nearly as trivial and I would have guessed based on my own biases and intuitions.

          It is easy to interpret such replication efforts as definitive that the effects in question aren’t real and the result of researcher degrees of freedom. I would submit, however, that this is biased consideration of the evidence and fails to consider alternative accounts that might be more positive. I have no idea whether either the Cuddy effect to the Bargh walking slowly effect is actually real, artifactual, or the result of researcher degrees of freedom, but that is my point I have no idea. Andrew you seemed to be convinced it is researcher degrees of freedom despite other possible (and I would argue plausible alternative explanations). That to me is the problem with this current blog post. A rush to judgment about both this line of research and the field as a whole. I hope you can hear these concerns without dismissing me, but if not that is fine with me too. I have said my piece,.

        • Steve: You write that I “seem convinced that the Bargh results are caused by experimenter degrees of freedom.” Not quite. I actually don’t think Bargh has any “results” that need to be explained!

          Suppose someone flips a coin 12 times and gets 8 heads. Whassup, some trick coin??? No, there’s no explanation needed. When you flip a coin 12 times, you can get 8 heads, it’s no big deal, that’s the kind of thing that happens when you flip coins.

          Similarly, when you gather data on 30 people and start to look for patterns, you find things. No big deal, nothing that needs explanation.

          That’s why I wrote in my earlier comment that you should simply imagine that the p-values in these studies were .4, .2, .3, etc., instead of .04, .02, .03, etc. If p-values such as .4 occurred, the effects could still be real. There could still be social priming! It’s just that the data are telling you essentially nothing useful about the sign or the magnitude of any effects, let alone how they might interact with person- or situation-level features.

          That’s what’s going on here. These experiments are simply too small and noisy to be useful. That was the point of my paper with Loken, and my paper with Carlin, and my paper with Weakliem, also the recent paper by Katherine Button et al., memorably called “power failure.”

          I have no doubt that priming is real, that it does affect our lives, and that it varies by person and by situation. But the way to study it is not to create situations where the effects will be tiny (for example by flashing the prime on a screen for less time than can be consciously perceived, or by burying the prime in a fake experiment) and then doing measurements on 30 people.

          Do I think that “all of the field of Social Psychology, or nearly all of it should be dismissed”? I have no idea. I know little of social psychology. I do know that leading journals have published a lot of social psychology papers based on experiments that contain essentially no useful information, which are then written up in the form of bold claims that are unsupported by the data. So that’s too bad, and I’m with Nosek and the others who’d like to reform these practices. But I just don’t know enough about social psychology to make general claims: I can well imagine that there’s a lot of work in that field that doesn’t have these problems.

        • Andrew, if only you knew the history of the field you would understand why very subtle and short duration primes were used in priming research. In the late 1960s and early 1970s there was another crisis in the field and several people posed demand characteristics as an alternative explanation for many of the effects in the field. Conscious priming studies are a post child for such demand alternative explanations and made long conscious priming studies essentially unpublishable. People turned to very subtle primes in an effort to render demand characteristics implausible.

          What I see happening now is that people like you are posing a new alternative explanation for the results in Social Psychology; researcher degrees of freedom which makes chance a compelling alternative explanation. I think you have overstated your case–flipping a coin is totally determined by chance, whereas psychological results even if they are nowhere near significant are still a result of noise and a signal (unless the underlying effect is truly zero–but there is no doubt that you and others including Brian Nosek (and ironically Joe Cesario who is the co-founding editor of a new journal that only published preregistered and high powered studies that have been reviewed before they are conducted and publishes the studies regardless of the effects obtained) have moved the field to do more to make chance a less plausible alternative explanation of the results obtained in the field.

          Looking forward hopefully we can agree that is a good development. Looking back at previous research I will continue to argue for a more carefully nuance interpretation of all the possible explanations for the reported results in previously published manuscripts. Nothing that has been said here has changed my mind about that at all, and in fact these discussions have raised serious concerns that I have heard others raise but I thought unlikely–that some want to rush to judgment to characterize past findings as the result of researcher degrees of freedom with a bias toward only considering factors that call the findings into question without considering evidence that makes researcher degrees of freedom less likely. I think we do the field a disservice if we quickly sweep away past research without careful thought and a full understanding of the theories that were being tested. To answer Q’s question, past theorizing can often tell whether a given prediction was a priori or post hoc. For example, based on previous research if a paper with Higgins on it had predicted a main effect for priming it would have been obviously post-hoc. His theorizing had emphasized for years such effects were moderated. Not everybody has previous theorizing that makes it clear that an argument is a priori or post hoc, but sometimes that is the case as it was in the studies discussed in this blog and when it is good scholarship will seek to understand the theories well enough to make this judgment. Other times the research is identified as part of a doctoral dissertation and was therefore subject to be publicly proposed before it was conducted. Here we can be pretty sure about a priori prediction and there is a public record of the dissertation that can be checked and committee members can even be questioned if one is skeptical enough. More commonly now people are preregistering research as Pashler did in the studies described above and those too we can know what is a priori and what is post hoc. So yes, this can be known at times.

        • Seems I reached maximum comment depth and couldn’t reply to Steve’s response below.

          I agree Steve that it’s unfair to bring up an old paper like this; I’m sure you understand statistics. I suspected you don’t because you don’t seem to see the p-value hacking issues in the papers under discussion, but I agree I don’t really have enough information about you to judge. I do apologize for the insulting implication of my comment.

          The thing that bothers me is the vigorous defence people bring up of all this p-value hacking. I wouldn’t mind it so much if people (not referring to you, but something I see in general) say, look, I gotta eat, I need a job, just let me do this, OK? One easy test (I have done this in my own research) is to just do an experiment and literally just do it again, the main thing that changes is that you have new subjects. It’s only rarely that I can get the same pattern of results to turn up. This is not priming research but the results are usually not stable. I’m an expert in my field and I have a whole lot of belief and knowledge about the problem I’m studying, but I can’t get the darn effect to replicate like theory says, not even my favorite theory’s effects. I am no expert on priming, but looking at the extreme certainty with which Bargh proceeds (just listen to the youtube someone posted in the comments), and given my experience of running and failing to replicate results, he’s either fooling himself or worse.

          In my opinion, your criticism of Andrew not having the domain expertise to comment on this topic is probably valid. I often feel that I would rather talk to a domain expert than a statistician when I have a question, because the statistician usually has no idea what can happen and doesn’t have the depth of knowledge that is needed to reason about a problem. They suggest all kinds of things that make no sense at all.

          What I don’t get is why you are so sure that there is no p-value hacking going on in areas like priming research. It is probably not deliberate cheating or anything like that; it’s just that it is so easy to fool oneself. I can see that you are really sure that there is no monkey business going on, but I just don’t understand why. How could your experience be different from others’? Whatever Andrew’s state of ignorance about priming, he and others seem to be right about this problem of p-value hacking.

        • Shravan,

          I appreciate the apology, but I don’t think you understand my position. I am not at all sure that there is no p-hacking involved in any of this research. I would not be surprised if there were things like stopping to check the significance of the finding and if it is not running 10 more participants per cell. I know I have done such things in my own research and didn’t understand the consequences of doing so at the time, but now I do and I have stopped that sort of practice. What I am quite confident of, however, is that a certain type of p-hacking, what Andrew seems to be called Whack-a-mole, didn’t happen and that is when people predict a main effect and don’t get it and then fish for a moderator and when they find a moderator they report the significant interaction instead of the moderator. I am quite sure that didn’t happen in the Cesario, Plaks, and Higgins paper, because based on years of theorizing before this paper I am quite sure that Higgins would have predicted the moderator from the start. In that paper I am quite sure that the moderator prediction was a priori and not a post hoc whack-a-mole strategy. Andrew in his blog clearly proposed it was a post hoc analysis. I argue that with reasonable look at the literature any one could easily tell it was an a priori strategy and that Andrew’s claim that it was a post hoc whack-a-mole strategy is false. That is the point about which I am sure. It is also a point that refutes the central claim in Andrew’s blog, a claim that makes the basis for the cute title. Simply put I am sure that the Cesario et al paper is not a whack-a-mole approach as Andrew claims, and further I believe Andrew could have known that will more careful and thorough scholarship.

          Now, am I sure that there was no p hacking in the Cesario et al paper? No, I am not. Am I sure there was no p hacking in the Bargh, Chen, and Buroughs paper? No I am not. Andrew, here, however, is proposing that these results (and especially the Cesario et all paper) were subject to a particular form of p hacking–whack-a-mole–in which people test a bunch of moderators until they find some sort of interaction and then treat that interaction as if the main effect was significant. I am quite sure that neither of these paper had that type of p hacking and I am quite sure about that because of the prior theorizing of these authors. So, could there have been p hacking in these studies? Perhaps there was, I can’t rule that out. Could there have been a whack-a-mole approach? No, I am quite sure there was not.

        • Steve:

          No, you misunderstand. The whack-a-mole I was discussing is happening between papers. It may also be happening within papers, but that’s not what I was talking about here. In particular, I was referring to the quote by Bargh:

          There are already at least two successful replications of that particular study by other, independent labs, published in a mainstream social psychology journal. . . . Both appeared in the Journal of Personality and Social Psychology, the top and most rigorously reviewed journal in the field. [JPSP also published Bem’s notorious ESP paper — ed.] Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.

          Three studies, three comparisons, The first study by Bargh et al. found a main effect. The second study by Hull et al. found a particular interaction (for individuals high in self consciousness). I have zero doubt that had the second study found a main effect, that Bargh would’ve counted it as a replication. The third study found a different interaction (for individuals who like the elderly). I have zero doubt that had the second study found a main effect, or had it found the interaction found in the second study, that Bargh would’ve counted it as a replication. Meanwhile, it makes me wonder why, if these interactions are so important, that Bargh was so sure that his first study, which did not look at those interactions, was correct.

          Any individual study has a story. Put it together and you have whack-a-mole.

          Again, I urge you to reflect upon how you would be thinking about these studies were the p-values reported as .4, .2, .3, etc, rather than .04, .02, .03, etc. In your attempts at understanding the field, you’re implicitly putting a huge burden on these p-values. That’s the ironic thing: you don’t like the message that this statistician is telling you—but it’s statistics that’s led you into this misunderstanding. It’s just like Tversky and Kahneman wrote about in the 1970s: research psychologists (and ordinary people) expect an unrealistic level of stability in small samples.

        • OK, Andrew I see your point about criticizing Bargh for calling the interactions in Hull et al, and Cesario, et al a replication of his main effect finding. They were not replications, but I haven’t tried to defend Bargh’s statement. What I have tried to challenge is your rush to judgment about the follow up studies.

          In this latest post you say the whack-a-mole charge was across studies, but you said earlier in this thread,

          “Regarding Cesario et al., let me repeat that the general theory (varying effects) was a priori, but the particular analysis was not (see Erikson’s comment elsewhere in this thread). And, again, Cesario et al. in their paper never claimed to have made their data processing, analysis, and presentation decisions ahead of time. Their analysis was contingent on the data.”

          This is clearly a charge of whack-a-mole within the Cesario paper. I have argued that if you know the literature you would know that their analysis was not contingent on the data, but was rather an a priori prediction. You were wrong in your charge and you made a rush to judgment about the nature of their analyses calling it post hoc when if you know the theory it was clearly a priori. This is lazy scholarship in my view because knowing their theory just means doing the appropriate homework. I challenged Erikson’s point because he or she too was jumping to the assumption the analysis was post hoc. I don’t think it is ground breaking to argue that knowing a theory is important to knowing whether a test is a priori or post hoc. You didn’t know the theory well enough and you came to the wrong conclusion.

          You made a similar error in your recent critique of Amy Cuddy’s power pose research in Slate. I highlighted that error above, despite my own misgiving about this research. The theory is really quite simple, a power pose leads to feeling powerful which leads to positive outcomes (which for some reason I don’t fully understand Cuddy operationalized as increases in testosterone and risky decision making in the paper in question. Now, the much larger n replication you review actually found solid evidence for the first part of this theory. The power pose led to robust feelings of power, and the replication even rule out demand as an alternative explanation. I was as surprised by this finding as anyone, because my own intuition is that the power pose would just make me feel like a dork. Somehow you seem to totally miss the importance of this finding, however. It is an important replication of the first part (and arguably the most important part of Cuddy’s model). Now maybe feeling powerful never leads to any positive outcomes and is totally an epiphenomenon, and the power pose certainly does not seem to replicate having an effect on testosterone or risky decision making. It remains, plausible, however, that the power pose could leads to feeling powerful which could lead to other positive outcomes. You totally miss that possibility in your review. Instead you focus only on what failed to replicate and totally missed what did replicate. Again it seems you rushed to a negative judgment about a line of research without fully understanding the theory and in doing so missed important aspects of the data and what that data suggests about the theory.

          It is my hope that as you continue to look at data in this field (and any other field) you take the time to learn the theories that are being tested because understanding the theories often necessary to understand what is actually being tested and the nature of the tests being conducted (e.g., whether they are a priori or post hoc).

          Finally, I think you haven’t taken the time to really understand what I have been writing. You say I have placed a tremendous burden on p values. I don’t think you will find that in anything that I have written. In fact, if you look at the arguments above the only claim that I make for an effect is the effect of the power pose on feeling powerful and this was in the replication study with an large n. I think I understand pretty well the importance of large samples. This criticism of me, however, seems to be yet another example of the pattern in your writing on Social Psychology. You don’t bother to take the time to understand what people are actually saying, but nonetheless rush to a judgment that unfairly negatively characterizes them. I urge you to take the time to understand what the people you are criticizing are actually saying before you develop your critique.

        • Steve:

          It seems a bit odd for you to believe that Cesario et al. would’ve done the same choices in data processing, analysis, and summary had their data been different, considering that they themselves never made that claim in their article, and considering that making all those decisions ahead of time was not something people were doing in that field, and considering the grab-bag of analyses actually presented in the paper.

          This is fine—that’s how I do applied research too. I am interested in certain substantive theories, and I modify the particular expression of such theories in light of the data. On the very rare occasions when I do a preregistered analysis, I make it clear that I have done so.

          Anyway, I repeat that you should simply imagine that the p-values in these studies were .4, .2, .3, etc., instead of .04, .02, .03, etc. If p-values such as .4 occurred, the effects could still be real. There could still be social priming! It’s just that the data are telling you essentially nothing useful about the sign or the magnitude of any effects, let alone how they might interact with person- or situation-level features. The theory may be great but the data are telling you a lot less than you think. Which is why different studies find different things: the theory is rich enough that it can predict a main effect as found by Bargh et al., interaction A as found by study A, interaction B as found by study B, etc.

          Again, I thank you for sharing your perspective and giving me the opportunity to clarify my thinking on these issues.

        • Spencer,

          I replied to your comment above about the analysis of Cesario et al., but given that there are already a lot of threads running lose, maybe you missed it. I tried to put together how well the statistical tests supported the non-statistical predictions.

          In summary: I don’t claim they have p-hacked, in the sense of intentional fraud, nor that the hypotheses are post hoc. What I claim is that their three studies have given very weak evidence that they readily interpreted as enough evidence for their theory.

          They haven’t p-hacked, in my opinion, but they have abused accepted practices: rounding down a p-value, changing from a two-tailed to a one-tailed test, multiple comparisons without any sort of correction, disregard to non-significant results from omnibus tests, etc. I am well aware that those practices are common in psychology research, so that it seems justified.

          Also, their hypotheses might not be post hoc, as you argue and I agree, but much of their statistical analysis is contingent on the data. This is also common in Psychology: very rarely researchers ‘translate’ their qualitative hypotheses to statistical hypotheses. It’s a problem because a qualitative hypothesis can be mapped to many statistical hypothesis, even when the qualitative hypothesis is proposed a priori!

          This means that their choice of test can change ad libitum, as they did: in the first study, the between group differences are tested via omnibus ANOVA (non-significant) and planned contrasts of the overall model; in the second study, the omnibus ANCOVA (also not significant) is followed by another ANCOVA with a subset of the data; in the third study, the omnibus ANOVA (again not significant!) is follow by pairwise t-tests, reported as one-tailed tests.

          Has the data suggested the tested (qualitative) hypotheses? No, they have proposed those hypotheses a priori, from their overall theory. But the analysis changed each time on the light of the data, which implies that they changed their statistical hypotheses. Even with those changes, which should imply in p-value correction, the p-values are marginal at best. If we correct them for “allowance for selection”, the results are no evidence at all for their theory.

          I am eager for the first issue of Cesario’s new journal that will publish pre-registered studies. I really think that, given that many psychologists just won’t change how they conduct their statistical analysis, pre-registration will help us see more clearly what is noise in the field. But I also think that pre-registration would not be so necessary if the researchers were careful in defining clear statistical hypotheses, with carefully chosen test statistics proposed before seeing the data.

        • Andrew:

          I think what is clear from the theory and the paper itself is that Cesario et al planned an a priori to tests as their primary analysis in each of the three studies. What is also clear from the theory is that they think primes prepare people to interact with the target of the prime and do not prepare people to enact the content of the prime. They have a different model of how priming affects behaviour than Bargh (at least as he wrote about it in his cognitive monster paper and tested that model in the Bargh, Chen, and Burrows) and they contrast their predictions with his in each of the three studies. In Study 1, they are doing a replication of Bargh, Chen, and Burrows Study 3, but using a different prime. Bargh, Chen, and Burrows argued their study demonstrated that when you prime people with the African-American stereotype they display hostility because the content of the African-American stereotype includes hostility. Essentially activating hostility creates hostile reactions. Cesario, et al. argue that priming the African-American stereotype led to increased hostility because people react to outgrip members (all the subjects in the Bargh, Chen, and Burrows study were non-Black) with hostility. So for the original study the two theory make the same prediction. In Cesario et al. Study 1, however, they used a different prime, “gay.” A presumed outgroup for most other participants but one that does not have hostility as a part of the stereotype–gay men are not stereotyped as being hostile in the way that African-American men are. This particular theorizing is totally consistent with Higgins theorizing about activation for at least 20 years. So, Cesario et al. had three conditions, a “gay” prime, a “straight” prime, and no prime. Their theory clearly would predict that the participants in the gay prime condition would display more hostility than participants in the other two conditions. They identified this prediction as a planned contrasted and tested it as such. Now, I would quibble with whether they should have done two t-tests or just one (one would have been the appropriate way to go in my view) but the general point is that this was an a priori test that followed from theory and they identified it as such and the primary analysis for the study. This is not whack-a-mole. It is the other end of the continuum.

          I could trace this same theorizing and carefully planned predictions in Study 2 & 3 as well. In each study the primary analysis that is theoretically derived is identified as a planned test and tested as such. Now you could take issue that in Study 1 one of the planned contrasts was p = .053 and for a planned test that means you should not reject the null hypothesis, and in Study 3 the planned interaction test was p = .065 and they should not have rejected the null hypothesis but that they reported these clearly planned tests even though they did not reach significance in my view argues against a whack-a-mole strategy of analysis rather than for a whack-a-mole strategy. If wanted to adopt a whack-a-mole approach they could have easily done so and brought the findings to p < .05, but they didn't.

          There is other clear evidence against a whack-a-mole approach in this study. Cesura has publicly published all the data and all the methods from this study. In addition, Cesario is one of the very few people in the field who publicly posts all the methods and data from his file drawer. Included in this file drawer are 4 failed replications of other Bargh priming studies (these involved the priming spatial distance). So, that too suggests a scholar who does not whack-a-mole, but instead makes it publicly known what he has found even if it isn't publishable. And on that issue of publish ability, Cesario is also one of the founding editors of a new journal that reviews proposal for preregistered studies and if those preregistered studies pass the review (done before data is collected), then they are published regardless of how the results turn out.

          So although it may seem a bit odd to you that I believe "that Cesario et al. would’ve done the same choices in data processing, analysis, and summary had their data been different, considering that they themselves never made that claim in their article, and considering that making all those decisions ahead of time was not something people were doing in that field, and considering the grab-bag of analyses actually presented in the paper." I think your statement here just makes it abundantly clear that you have not done your homework either with this paper or with Cesario's practices in publishing. I think in the paper it is abundantly clear what the primary hypotheses were and that they were planned, and think in looking at Cesario's practices in both this paper and his other work there is strong evidence that he has adopted a different approach than was typical in the field. For you to paint him with the brush of your perception of the field is again in my view lazy scholarship and a rush to judgment even when there is ample evidence against that judgment.

        • Steve:

          1. For details on the choices of Cesario et al. which no way could’ve all been chosen ahead of time, see Erikson’s comment in this thread.

          2. By saying that Cesario et al. did not fix their data processing, analysis, and presentation choices ahead of time, I’m not painting with a brush or rushing to judgment or whatever. I don’t fix my data processing, analysis, and presentation choices ahead of time either.

          3. I agree with you that the theory is interesting and that it has many observable implications. Unfortunately these sorts of studies are just too noisy to learn anything useful. Again, I can understand that you don’t want to hear this—but ultimately you’ll have to pin the blame on the mathematics, not me. It’s just what happens when you try to study highly variable effects in this way. This literature is an exercise in the chasing of noise.

  8. Thanks. The term “stable effect” is very helpful. It’s a polite but clear way to express caution and skepticism: “That’s an interesting result. I wonder if it will prove a stable effect in replication studies?” That puts the burden of proof where it belongs.

    We can proceed from, say, mechanical physics to mechanical engineering with some confidence because the effects found in physics tend to be quite stable. Similarly, the search for stable effects should apply to proceeding from social science to social engineering.

  9. Here’s another ‘beauty’:

    “Moving While Black: Intergroup Attitudes Influence Judgments of Speed”

    http://www.apa.org/news/press/releases/2015/11/racial-anxiety.aspx

    The hypothesis is absurd, the design flawed in every possible way, they drop 10% of observations for being outliers, create a truly arcane scoring metric, analyze ordinal Likert data as continious, find R2 values around 0.04 to 0.10 (all marginally significant), and then say their hypothesis has been confirmed.

    “To control for potential effects of social arousal, a mean-centered version of this variable was also entered into the model as a predictor. The expected three-way interaction between intergroup anxiety, target race, and direction was statistically significant (B = 0.135, p= 0.035, R2 = 0.017).”

    The authors themselves are only partly to blame- the fact that no reviewer or editor flagged this ‘study’ really shows how deep the rot is in psychology. Expected three way interaction? Are you kidding me??

  10. Curious as to what people think of this comment on the Bargh et al. (1996) paper from Pubpeer: https://pubpeer.com/publications/E0E81B39485A81B61098BEE8726409 (see below).

    In Experiment 3, the experimenter rated participants on irritability, hostility, anger, and uncooperativeness on 10-point scales. Independently, two blind coders rated the same participants’ facial expressions (from video-tapes) on a 5-point scale of hostility.
    Figure 3 in the paper displays the mean hostility ratings from the experimenter and the blind coders for participants in the two conditions (Caucasian prime and African-American prime). The means from the experimenter and from the blind coders appear almost identical (they are indistinguishable to my eye).
    How likely is it that the experimenter and the blind coders provided identical ratings, using different scales, in both conditions, for ratings of something as subjective as expressed hostility?
    Are the bars shown in Figure 3 an error?

  11. The Doyen et al. “failed replication” has been badly designed.

    I am not a priming expert, nor have I ever done a priming study. But I always knew that you mix the priming terms with other terms, as otherwise it is too obvious, and the non consciousness element of priming is lost.

    Doyen at al. did not know this. They have technical excuses, as the specific paper did not have this point (which *I* knew) in writing. So they have ignored this central point, and all the facebook and twitter social science still knows that this was a failure to replicate.

    We learn how bad this form on non peer review works:
    1) a central part is missing from a study, yet it is hailed everywhere as a “failure to replicate”

    2) Poor John Barge has additionally ranted in a blog post other points besides several errros in the study design.
    You guess it. Everyone remembers how poorly Barge’s rant post was. NOBODY remembers that he had very valid points showing the “replication” was not a replication, but a faulty imitation.

    • Jazi:

      The bad news is that even Bargh’s own defense mentions replications that are not replications in that each new paper introduced a new interaction. The other bad news is that Bargh’s study never had a chance: the underlying effects are too small or too variable (take your pick) to be detected in such a small study. So if Bargh’s original study did find a real effect, it was blind luck. And, as I’ve noted many times, had his study found that elderly-priming made people walk faster, that would’ve been equally notable and equally consistent with his general theory of priming.

      It goes like this: these studies can’t lose. An effect in one direction is evidence of the theory. An effect in the other direction is evidence of the theory. And with the garden of forking paths you don’t have to worry about getting no effect. But if for some reason you do get a null effect, you can explain that too as being evidence of an interaction.

      • off topic somewhat….

        Regarding “too small” effects. And generally not just here.

        The idea of lab experiments is to maximize an effect, via optimized design that is supposed to accentuate an effect.

        I assume of course no foul play and no innumeracy……

        Such lab experiments are by definition sensitive to details as most optimizations/contraptions are.

        I have not looked at the power etc. Not did I study the various replications.

  12. Sorry to be late to the party but thought I’d add just a couple comments here.

    1. There’s a double-irony that started the long thread by Spencer above. The “seriousness” mentioned in the initial comment by Spencer wasn’t about me taking priming effects seriously. (As Andrew notes, that should have no bearing on the truth value of the claim.) The Perspectives 2014 piece was about taking *replication* seriously, and was in fact a “Whack-a-mole” argument even though I didn’t use that term. Here’s are some relevant quotes:

    “So how is it that any one researcher could ever find an effect given what appears to be an overwhelming avalanche of potential variability? It is important to be cautious when answering this question because although there are reasonable answers, there are also unreasonable answers that devolve into excuse making for poor research practices by those of us in the field of priming.”

    And, after noting that it is *possible* for researchers to hit upon the right combination of moderating variables to obtain an effect, I wrote:

    “However, there is a danger in this explanation—it may be used to cover up poor research practices and prevent findings that actually are Type I error from ever being corrected. This short circuits the self-correcting nature of science, and it is essential to prevent this from happening. If “researcher degrees of freedom” and questionable research practices (Simmons et al., 2011) are allowing priming researchers to publish findings that are actually Type I error, then it would be a mistake to explain away any subsequent failures and to continue to defend such findings. Along these same lines, priming researchers cannot appeal endlessly to “unknown moderation” without doing the work to provide evidence for such moderation. At some point, the evidence may shift, and it would be more reasonable to conclude that the original effect was wrong or was so specific as to be rendered meaningless. Indeed, the combination of small sample sizes, very large effect sizes, and few direct replications should be a cause for concern.”

    By that time it had become clear to me that there was a more-likely-than-not chance that the field was engaged in a giant game of “Whack-a-mole” and the 2014 paper was in part an attempt to get priming researchers on board with changing research and reporting practices. Kai Jonas and I also published a shorter blog piece at COS (http://osc.centerforopenscience.org/2014/04/09/expectations-1/) arguing the same. Regarding “hidden moderators,” for example:

    “Instead, the correct response is to productively work with other researchers to systematically establish that such moderation did in fact occur! When a failure occurs, the best course forward is to register a replication involving both locations/populations of interest, ideally with an extension of the original study that measures or manipulates the hypothesized moderator… If priming researchers will not do this, then we forfeit the right to continue to talk about this as if it were an established, reliable effect.”

    So at core, I agree with Andrew’s post and the comments critical of our work that follow Spencer’s original comment.

    2. The second irony was Spencer’s defense of the Cesario, Plaks, and Higgins study. As much as I appreciate the very impassioned defense, Steve, it’s unwarranted, both for the reasons described in the 2014 Perspectives piece that you cite (no direct replications, probing of tests for other moderators, etc.) and for the criticisms raised by the other commenters. Most important though, is Andrew’s comment: “Regarding Cesario et al., let me repeat that the general theory (varying effects) was a priori, but the particular analysis was not… it makes the p-values essentially meaningless—but this is a point that until recently was not widely understood.” I couldn’t agree with this more, and the importance of this point wasn’t clear to me at the time we did those studies. In part this is a consequence of having too general of a theory, or if you will, of having a theory that is so underdeveloped that more precise predictions cannot be made. At some level this is a problem of psychology being a younger science dealing with difficult subject matter, but it’s a problem nonetheless. If I predict a 2×2 interaction and I’m interested in the difference between conditions A and B, the theory that generated this prediction may not be precise enough to generate a prediction for what’s going to happen in conditions C and D: It’s silent on what those two conditions will look like. At some level this is fine, but this allows me to claim support if C and D are the same *and* if C and D are different (as well as a bunch of other patterns relating C and D to A and B). And without providing direct replication for the effect, the evidential value is low. As Andrew and others have noted the Garden of Forking Paths problem is subtle and perhaps more problematic than QRPs and other problems that have been raised in recent years. And the CPH paper needs to be understood in light of this.

    3. The one point I would take issue with, Andrew, is your criticism of Bargh’s failure to test moderators in his 1996 paper. I think you’re misunderstanding both the origin of that work and the timeline of the moderator work. Bargh didn’t test moderators because his “theory” at the time *said there should be no moderators.* He derived his predictions for the study based on the theory he had about how priming should influence behavior. That “theory” says there is a direct activation effect of a prime: If you’ve got it stored that the elderly are slow, priming elderly will activate “slow” and directly impact your behavior. From this point of view, there should be no moderators of the effect by things like motivation, individual differences in liking of the elderly, etc. And none of this work had been done at the time he published that initial finding.

    So when you state “The original Bargh, Chen, and Burrows paper from 1996 looked at no interactions at all!…And, if these interactions are so important and so theoretically motivated, why did Bargh, Chen, and Burrows not include these interactions in their own study?” it’s actually pretty clear why they didn’t include those interactions: His “theory” said there shouldn’t be any *and* none of the work claiming to show interactions in this type of priming had yet been done.

    Let me be clear: We’re in agreement that there are a lot of problems throughout! I’m just taking slight issue with the claim that Bargh should have tested for moderators. He was justified in not testing for it because his “theory” said there shouldn’t be moderators. Now, if he then goes on to do priming work without testing moderators *while still claiming the moderator work as supportive,* then one can rightly criticize him. But I don’t think it is justified to criticize his 1996 work in this way.

    4. There is one final thing I take a very slight issue with, Andrew, in your posts (but this is minor and I agree with so much of what you’ve written that I hesitate even bringing it up). I feel as though your posts have included a lot of “us vs. them” framing lately: On one side there’s me and Wagenmakers and Simonsohn, and on the other side there’s Cuddy and Bargh and all the rest. It ignores the huge middle: All the researchers who are motivated to do good science but for whom the last couple of years have been uncharted, turbulent waters and who need an encouraging, welcoming hand to guide them (rather than a condescending middle finger–which is not to say that you personally have done this, only that the us vs. them framing doesn’t help here). In part this is probably just personal preference because I have found myself in this middle. (Although I have many things to be thankful for about my graduate education, unfortunately stats training isn’t one of them.) When those of us in the middle have had to face decisions about how to proceed after learning about these issues that you and others raise, it’s just more productive to have the person raising them be helpful and encouraging. Again, I’m not saying you’re not helpful and encouraging! On the contrary, you’re incredibly helpful and encouraging! Only that the us vs. them framing doesn’t lend itself to this and can be a real turn-off to that middle field. Those people who have been most influential in getting me to change my practices and get better training over the last few years (co-founding the first preregistration-only journal, replicating everything, learning Bayesian stats, putting all data and materials online, larger samples, etc. etc.) were not the people who came out with a religious zeal for their “side” and they’re not the people who are currently publicly shaming people based on their past work. And it bothers me slightly that the ‘us vs. them’ framing might encourage that approach more, which can have the unintended consequence of turning people off. I understand the utility of distinguishing among groups, and I do agree that there’s something wrong with the extreme version when a person learns about these problems but then *doubles down* and continues onward as if nothing is wrong. But between that and you there is a huge mass of people. Again, I hesitate even bringing it up and I might be wrong–perhaps setting camps is more beneficial than harmful. Just some food for thought.

    • I agree with Kyle that is a wonderful post in a great thread but my own view is that the public shaming approach sometimes becomes necessary. I have spent the last five years trying to get very influential people in my field to either correct their errors – most of them quite clearly willful – and to inform the “middle ground” of what has been going on. Progress has been minimal. Some retractions and expressions of concern have been issued but the responsible authors continue to engage in the same shenanigans, and (what is worse) young researchers are not only pursuing research based on these highly problematic findings but are modeling their own research and data analytic practices on the successful behaviors of these “stars”. When authors and journals refuse to act then a more public approach is surely called for.

  13. I guess some of this could be attributed to the difference between ‘real science’ and ‘social science’. Subtle things matter, and it will always remain preferable (and easier) to present positive findings. It comes back to the trap (and risk) of confirming your own biases.

  14. I stumbled across this thread after reading a book by John Bargh. I found it more interesting than the book and was very impressed by the civility and grace of the conversation. I am curious about why these kinds of results are sold in a deceptive way. Is it lack of statistical expertise or is there a more sinister motivation?

    • Rene:

      There are lots of reasons for overstating results, but perhaps the simplest is what might be called “incumbency.” A result is out there, it’s published, people believe it, and then they feel that some sort of extraordinary argument is needed to knock it down. The trouble is that the publication process is not the ensurer of quality that people would like it to be.

Leave a Reply to Laplace Cancel reply

Your email address will not be published. Required fields are marked *