Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good

[cat picture]

Susan Fiske, Daniel Schacter, and Shelley Taylor write (link from Retraction Watch):

Psychology is not in crisis, contrary to popular rumor. Every few decades, critics declare a crisis, point out problems, and sometimes motivate solutions. When we were graduate students, psychology was in “crisis,” raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Researchers rose to the challenges, and psychological science soldiered on.

This decade, the crisis implicates biomedical, social, and behavioral sciences alike, and the focus is replicability. First came a few tragic and well-publicized frauds; fortunately, they are rare—though never absent from science conducted by humans—and they were caught. Now the main concern is some well-publicized failures to replicate, including some large-scale efforts to replicate multiple studies, for example in social and cognitive psychology. National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.

All this is normal science, not crisis. A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.

Annual Reviews provides an additional remedy that is also from the annals of normal science: the expert, synthetic review article. As part of the cycle of discovery, novel findings attract interest, programs of research develop, scientists build on the basic finding, and inevitably researchers discover its boundary conditions and limiting mechanisms. Expert reviewers periodically step in, assess the state of the science—including both dead ends and well-established effects—and identify new directions. Crisis or no crisis, the field develops consensus about the most valuable insights. As editors, we are impressed by the patterns of discovery affirmed in every Annual Review article.

On the plus side, I’m glad that Fiske is no longer using the term “terrorist” to describe people who have scientific disagreements with her. That was a bad move on her part. I don’t think she’s ever apologized, but if she stops doing it, that’s a start.

On the minus side, I find the sort of vague self-contragulatory happy talk in the above passage to be contrary to the spirit of scientific inquiry. Check this out:

National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.

What a touching faith in committees. W. H. Auden and Paul Meehl would be spinning in their graves.

Meanwhile, PPNAS publishes papers himmicanes, air rage, and “People search for meaning when they approach a new decade in chronological age.” The National Academy of Sciences is supposed to be a serious organization, no? Who’s in charge there?

The plan to rely on the consensus of authority figures seems to me to have the fatal flaw that some of the authority figures endorse junk science. How exactly do Fiske, Schacter, and Taylor plan to “uphold standards” and improve graduate training, when purportedly authoritative sources continue to publish low-quality papers? What do they say if Bem’s ESP experiment makes it into a textbook? Should psychology students do power poses as part of their graduate training? A bit of embodied cognition, anyone?

I really don’t see how they can plan to do better in the future if they refuse to admit any specific failures of the past and present.

Also this:

When we were graduate students, psychology was in “crisis,” raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Researchers rose to the challenges, and psychological science soldiered on.

Yes, it “soldiered on,” through the celebrated work of Bargh, Baumeister, Bem . . . Is that considered a good thing, to soldier on in this way? Some of the messages of “measurement validity, theoretical rigor, and realistic applicability” didn’t seem to get through, even to leaders in the field. It really does seem like a crisis—not just a “crisis”—that all this work got so much respect. And did you hear, people are still wasting their time trying to replicate power pose?

And this:

A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.

Yes to this last sentence but no no no no no no no to the sentence before. Or, I should say, not always. To paraphrase a famous psychiatrist, sometimes a replication failure is just a replication failure. The key mistake in the above quote is the assumption that there was necessarily something there in the original study. Remember the time-reversal heuristic: in many of these cases, there’s no reason to believe the original published study has any value at all. Talk of “limiting conditions and contextual effects” is meaningless in the contexts of phenomena that have never been established in the first place.

If you want to move forward, you have to let go of some things. Not all broken eggs can be made into omelets. Sometimes they’re just rotten and have to be thrown out.

117 thoughts on “Authority figures in psychology spread more happy talk, still don’t get the point that much of the published, celebrated, and publicized work in their field is no good

  1. Thank you Andrew. One thing this blog has done is made me seriously question the common knowledge that has floated around the last couple decades. Specifically, it seems a lot of the dietary advice that people believe is healthy (eat more vegetables and less red meat, for instance) comes from studies such as the Framingham Nurses’ Health Study and the China Study. How do I start evaluating the massive literature on subjects like this? Could it be that people eating salads for lunch are doing the equivalent of a power pose in the mirror?

  2. This might be a good time & place to talk about what to conclude when there is a replication failure. How do you know this means the original study was wrong? How do you know it isn’t the replication that is wrong? I’m not trolling, I think it’s a valid question and many readers might ask the same question after reading the Annual Reviews bit by Fiske et al.

    Fiske et al seem to imply that replication failure of a published paper means the replication was not done right, or worst case scenario, it did not replicate due to randomness (sometimes they just don’t).

    • You know this through careful analysis of the original study. “Careful analysis” means, among other things: (a) discoveries of effects too large to be true (Kanazawa)/too subject to forking paths (ovulation and dress color)/too dependent on p-values as a rhetorical device (everything)/too noisy to yield results; (b) critical thinking on alternative pathways either considered or not considered in the paper; and (c)-(z)[insert favorite pet peeves here…. I’ve put mine in]

      But really, it’s just being critical of assumptions, methodology, data gathering and all the stuff peer reviewers would do if peer review were an effective check on the publication of wrong results rather than a mere stumbling block for the sloppy and a filter for received wisdom of the in crowd.

    • It would depend on the statistics of replication, I suppose. But at a minimum, a failure to replicate strengthens the evidence on the noise side of noise vs real effect. And if the failure of a replication were truly caused by failure to properly duplicate the conditions of the original (as the original researchers sometimes claim), that would evidence that the effect, even if there is one, is restricted enough that who know whether it applies t the real worls in different settings?

    • Informally, I like Andrew’s time-reversal heuristic. Replications nowadays are usually well-powered, well-defined studies, if we compare to the original research (at least in soc. psych). Replications usually present adequate power calculation, especific set of instructions, pre-registered analyses… If it came first with a non-conclusive result, we wouldn’t give a second, sloppy study the same weight. Of course, all those qualities should be evaluated by careful analysis, as suggested by Jonathan.

      Formally, well, we have a nice theorem in probability theory that tell us how to update our beliefs in light of new evidence. Building a model for scientific discoveries is not trivial, though.

    • I think we should borrow from predictive analytics (a subset of data mining).

      To provide a brief stereotype of the field, you take part of your data (the training set) and beat it all to heck. Transform variables, drop cases, impute tons of missing data, change the variable you are predicting, etc. But when you do this, you KNOW you are very likely to be overfitting. So, you have a completely clean data set to replicate on. If it doesn’t replicate, you try again.

      It seems to me very difficult to convince researchers not to go down the garden of forking paths (etc). This is due to publication pressure and basic human frailty, sure, but also an honest belief that they’ve learned from the data in the course of the study, and “this is the way it really is, not how I thought it was a year ago when I planned the study.” But without replication, how do we know?

      If we take Occam’s razor to this problem, I’d answer Garrett M’s question by saying that if it doesn’t replicate, the original finding is likely trivial enough to ignore.

      • Zbicyclist:

        I think what you’re saying makes sense in many fields, but:

        1. In economics, political science, and other historical fields, you often can’t replicate. We can’t get data on 20 more recessions or 20 more wars; we have to make decisions on what we have.

        2. With hierarchical Bayesian modeling, it should be possible to analyze all the forking paths (or at least a model of such). As Jennifer, Masanao, and I discuss in our article, hierarchical modeling resolves the fundamental multiple comparisons problem.

  3. Don’t worry, Andrew, the respect anyone is feeling about these “authority figures” is rapidly diminishing in the field. If you talk to the younger graduate students, most of them are aware that they are joining a field with pooh-bahs whose own work is in many cases probably unreplicable and replete with shoddy statistics. There is very little or no funding being given to these fields at this point. The band is still playing and the MCs are trying to smile, but the charade is winding down.

  4. “And this:

    A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.

    Yes to this last sentence but no no no no no no no to the sentence before”

    Thank you for pointing this out!

    The APS Observer article by Susan Goldin-Meadow seems to try and make a similar point as Fiske et al.:

    “There is currently an effort to raise the status of replication in experimental studies and devote some of our precious journal space to making sure a phenomenon is robust across labs (e.g., Nosek & Lakens, 2014). These efforts seem reasonable to me as long as they do not become exercises in fault-finding but are seen as what they are — ways to test the robustness and generality of a phenomenon. Barring intentional fraud, every finding is an accurate description of the sample on which it was run. The question — an important one — is whether the findings extend beyond the sample and its particular experimental conditions.”

    (http://www.psychologicalscience.org/observer/preregistration-replication-and-nonexperimental-studies).

    • I would argue that Goldin-Meadow is not denying the problem like Fisk et al. Goldin-Meadow is just cautioning people to be respectful in the process and recognize that even if a finding is found non-replicable there’s no reason to be throwing around blame. There are plenty of finding for which that is precisely the case. On the contrary, Fisk et al are acting like every published effect is important because… published… because… science!

      • Psyoskeptic, I think you are not quite skeptical enough. “Barring intentional fraud, every finding is an accurate description of the sample on which it was run. The question — an important one — is whether the findings extend beyond the sample and its particular experimental conditions.”

        This statement is manifestly false. See anything Simonsohn/Nelson/Simmons/John/Schimmack have written about phacking and questionable research practices; see anything Gelman has written on “The Garden of Forking Paths.” Every finding is NOT necessarily an accurate description of the sample on which it was run. The statement is at best deeply ignorant and naive. It is manifestly false; given the publicity about these issues, it is plausibly interpretable as “willfully blind.” Otherwise, “Listening to the Beatles causes people to become younger” would be “an accurate description of the sample on which it was run” (Simmons et al, 2011, False Positive Psychology).

  5. Regarding the last couple of paragraphs of this, one of the most frustrating things for me is the failure of senior figures in psychology to at least consider that a replication failure might indicate that the original finding was a false positive. Given all the discussion of the ways that widespread methodological and statistical practices can raise the rate of false positives (through the garden of forking paths) it seems perverse to continue to stick to the line that replication failures indicate only the limiting conditions and context sensitivity of the original findings.

    Also, I think it is possibly a mistake to assume that outright fraud is rare because only a few cases have been uncovered. Those who have uncovered these cases have written about how incredibly difficult and costly fraud is to detect with certainty.

  6. “Now the main concern is some well-publicized failures to replicate”

    No, that is a symptom. The main concern is that their method of research still hasn’t resulted in a theory predicting anything more precise than “A is positively correlated with B” (because it is incapable of leading to cumulative understanding). Since there are no precise theoretical predictions to distinguish between, there is no need to collect accurate data.

    This type of research only persists by using the completely debunked metric of statistic significance as a false criteria for productivity. Even the (worthless) standards of NHST aren’t followed though, “p-hacking” is everywhere! Meehl put it right in one of his video lectures[1], it is just a sham.

    [1] http://meehl.umn.edu/recordings/philosophical-psychology-1989

        • yes, sorry, I meant papers. I was recently reading Ioannidis’ papers, and I was amazed that he just keeps rehashing standard textbook stuff again and again in paper after paper. Basically he is so famous that he can just keep writing the same thing over and over again.

        • I see Ioannidis as a dedicated communicator of things that need to be communicated to a wide audience rather than as someone claiming to be producing new research.

        • I agree with Martha’s take. The question is who else is coming up with new ideas? I think it is much harder b/c the biomedical fields and social sciences do not recruit the most creative candidates to graduate schools. Mostly those who are analytical.

        • And what needs to be communicated? NHST isn’t misused, it is worthless. The paper I would write (if I were more ambitious) would be composed, in part, of suggestions about how to present data (using hypothetical or real data) visually. I’m sure that something like this exists already though – maybe even in mainstream psychology. I seem to remember someone named Tufte?

        • Shravan:

          Martha has a point about communicating versus development but

          I do think this is a general problem – not just specifically about any given author – but this process where someone or group gets seen as the guiding light and all the resources get thrown their way.

          Behind this I think is the career selection that leads to being seen as a guiding light that favors those who game the system.

          Given Ioannidis is seen as one of the guiding lights to fix the reproducibility crisis, one worries what percentage of the guiding lights (who are hoped to fix the reproducibility crisis are doing largely what they are hoped to undo.

    • Thank you for the support, however I’m not a professional “methodological terrorist”. I am interested in developing software that can detect mathematical inconsistencies, and ways to automate checking papers, however I don’t usually try to see what errors I can uncover in someone’s work.

      The pizza papers just happened to fall into my lap because of his blog post, and they were hard to ignore.

      I hope that others start to use granularity testing to check work that they think could be questionable.

  7. This sentence…

    “A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects.”

    … shows a complete lack of knowledge on philosophy and statistics (sampling error). How can you call yourself a scientist, if you see truth in this sentence?

    I use sentences like this in my statistics and methodology courses to demonstrate the crisis in science. Many findings that are not replicated is bad, but the real crisis is scientists seeing truth in sentences like above. :-(

    • Seems reasonable that many (most? Idk) failed replications point to false positives (esp where replication is methodologically very similar to original and higher powered), but others result from contextual factors that influence when an effect obtains and when it doesn’t. E.g.,
      http://m.pnas.org/content/113/23/6454.abstract

      Of course it’s often practically quite difficult to sort out the cause of a given failed replication, but I think it’s important to try because these two different scenarios suggest very different paths forward for subsequent research in a given vein.

        • I don’t think the first of these responses is particularly convincing. Inbar’s is stronger, though as he correctly notes in his conclusion, it has hard to partial out sub-discipline from contextual sensitivity of phenomenon. Of course, we would all agree that contextual moderation of effects is a real thing. The question is how often this explains failed replications in the social sciences.

        • What i find most interesting about possible “contextual moderation” in this case is that it is supposed to have an effect here on replication success of so-called “direct” replications; i.c. what i would view as the optimal circumstances of trying to corroborate the original findings.

          If that doesn’t even seem to work, then what does that tell us about the ability of (social) psychological science to meaningfully explain and predict human behaviour?

          Whenever i hear the words “hidden moderator” or “contextual sensitivity” i am reminded of Meehl (1967):

          http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

          “It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous
          and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever ,once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher,
          undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.2″ (p. 114)

          If i were a social scientist, i would want to work on topics/findings that have a bit more backbone and are robust to large changes in the experiment, but that’s just me. I just couldn’t be bothered anymore with an effect if it isn’t even being successfully replicated in a “direct” replication, where there are minimal changes in the experiment.

        • -Effect size is important and underrated
          -Contextual moderation is real
          -You’d be hard-pressed to find many social dynamics that aren’t subject to some sort of moderation

    • Thomas:

      I followed your link which is a discussion of a book called, How Professors Think. My problem with that whole endeavor (the book, not your discussion), or at least with the title, is that “professor” is such a broad category, encompassing me and Susan Fiske and Uri Simosohn and tenured professors of social work and adjuncts in all sorts of fields.

      • I agree, Andrew. Maybe I should have said “how professors think when they’re on a panel to review research proposals”. That’s what Lamont claims to know something about. But you’re right. Her book, I finally decided, should have been called “How one Ivy League professor thinks other professors at institutions within driving distance of her university think, based on what they told her they were thinking about when they were in a room with other professors trying to assess the quality of other professors’ thinking. (Spoiler: she thinks they think much the way she does.)”

  8. Cf. “Die Trying”. PS — Where you got those other forthcoming titles: Lee Child bribed you. Apparently Michael Connelly and James Lee Berke did not. Hardly surprising when one thinks about it… ????

  9. Andrew: I really like what you write (I actually tweeted this post) but ironically writing about the “celebrated work of… Baumeister” is a bit unfair, I think. Yes, his work on ego depletion is questionable, but he has written many other influential and solid papers about other stuff.
    I’d like to see more care when just throwing names like that.

    Ed

  10. Fiske et al’s latest contribution is astounding. I imagine this is the “never concede any point” style of argumentation that many people, even scientists adopt. The assumption always is, I am right and you are wrong. Faced with the fact that experiments are non- replicable, they come up with tepid explanations in order to protect their position. We need more Carneys and less Fiskes. I guess one day Carney will be a seniir and respected scientist (maybe already is; i certainly have enormous respect for her) so I am optimistic like Fiske, but for different reasons.

    • Fiske is digging her own academic grave, she will be remembered as one of the most prominent reactionary scientist in psychology, holding back the important reforms we need to make.

    • Shravan:

      You write that Fiske is following a “never concede any point” style of argumentation (of the sort that I like to mock by posting that image of the Sopranos crew in front of the pork shop).

      The funny thing is, though, I doubt that Fiske et al. see this article of theirs in that way. My guess is that they see it is an eminently reasonable, statesman-like, split-the-difference sort of document. And, rhetorically, the document looks moderate. The words are very soothing:

      Psychology is not in crisis . . . Every few decades . . . Researchers rose to the challenges, and psychological science soldiered on. . . . National panels will convene . . . Graduate training will improve . . . normal science, not crisis . . . not a scientific problem . . . the cycle of discovery . . . Expert reviewers periodically step in . . . the field develops consensus . . . we are impressed by the patterns of discovery affirmed in every Annual Review article.

      I’d feel more confident in the last bit if I didn’t have a sinking feeling that Susan Fiske is also impressed by the patterns of discovery affirmed in every article she’s greenlighted for PPNAS.

      To return to my point: in its content, the article by Fiske, Schacter, and Taylor is a stubborn refusal to admit error, truly a non-scientific attitude. But in its music, it is pleasant and reassuring. No methodological terrorists, just a positive message about the future. (Indeed, Fiske made a major tactical error with that “methodological terrorism” article last year, as it removed her chance to shoot back at critics like myself by bemoaning a lack of civility. Once you liken your opponents to terrorists, you can hardly complain when they respond with mockery and indignation.)

      I agree with you entirely that the views expressed by Fiske, Schacter, and Taylor are weak. And, as a scientist, they embarrass me. But may people (including, I suspect Fiske et al.) read more for the sound than the content. I’ve noticed this a lot, that people say things or write things that make no sense but just kinda sound good.

      Or sometimes they don’t even sound good—remember that quote, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%”? That’s a ridiculous claim, and it even sounds stupid. I think those dudes were in an echo chamber.

      • One further point: for scientists like Fiske statistics is just a legitimizing tool to tell the story they wanted to tell anyway. Stats makes them look serious and quant-like. That is one reason why they don’t care what you say.

        • Shravan:

          I think you’re right but I don’t think this is how Fiske sees it. My guess is that she sees herself as a classical Popperian, building theories, making predictions, and testing them with experiment. I assume she holds the standard view that if p is not less than .05, you can’t feel certain of your results. And I assume she legitimately would be happy to abandon her theories if, in her view, the data rejected them.

          But she doesn’t seem to know about or understand Type M and Type S errors, or researcher degrees of freedom, or the garden of forking paths. It’s becoming harder and harder to ignore these issues, but she’s doing her best.

          Also, via the usual approach to null hypothesis significance testing, Fiske never tests her theories, she only tests the “null hypothesis” which is the antithesis of her theories. So she can reject the null (this counts as support for her theory) or not reject it (in which case her theory still stands). Testing her theories is not on the agenda. But I doubt she’s thought things through to this point. There are only 24 hours in the day, after all!

        • Not sure what Fiske, Schacter, and Taylor themselves think. However, my guess is that many talented but misguided researchers view p-values like we view speed limits on highways.

          Imagine someone who honestly thinks, “If we find p < 0.05 in our study, then the effect should be considered to exist. If p < 0.05 in the original study, failures to replicate this effect should almost always be considered evidence that the originally reported effect is context-dependent, rather than wrong."

          How could they think this in an honest way? Maybe they do see the most obvious problems with the rule, "If p<0.05, the effect should be considered real." However, maybe they also view this rule as an imperfect and arbitrary, but simple criterion that society has agreed is better than leaving the choice to the discretion of the individual. Like speed limits.

          This would explain something that, at face value, seems startlingly contradictory — how these researchers can both (A) think p-values are great to use and also (B) fudge and p-hack a lot!

          Personally, I think that both (A) speed limits are necessary and (B) it's fine to speed sometimes. That isn't a contradiction — if people can't be trusted to choose their own driving speeds, then it is better to have an arbitrary, simple, enforceable rule and break it sometimes than to have no rule at all. So, if these researchers view p-values like speed limits, then they could be in favor of using p<0.05 to decide whether effects should be considered true, while also thinking its OK to fudge and p-hack sometimes. From their internal perspective, there may not be any real contradiction.

          I see p<0.05 it as a deeply flawed tool that was designed to get at truth, not a speed limit. But perversely, for someone already views p<0.05 as a speed limit, the more flaws are pointed out with p<0.05, the more convinced they may be that p<0.05 is a speed limit — instead of a tool for obtaining truth — and the more inclined they might be to cheat and p-hack as a result. You couldn’t possibly convince me speed limits should be removed altogether by pointing out more and more of their inconsistencies. You’d just make me think, “Maybe I should speed more.”

          If I’m right that misguided researchers view p-values like we view speed limits, how does one stop this? You could try and be the police, and apply a complex standard that is not based on p-value specifically. Ahem, Andrew. But that requires a lot of policing. Maybe its better to specifically argue that p-values aren’t speed limits. That instead, they are tools of getting for getting at truth — really bad, overused ones – and that you should be using better tools.

        • “And I assume she legitimately would be happy to abandon her theories if, in her view, the data rejected them.”

          Anyone out there who has read her 300+ papers? Has she ever conceded that one of her theories is wrong given the new evidence? I’m guessing the answer is no.

          Her algorithm guarantees non-convergence to the truth. When you run an experiment, you will either get evidence for or against your favored theory. If you get evidence for, you are done. If you get evidence against, you are back in a loop of looking for new explanations for why it didn’t pan out. This is also how many psycholinguists, who believe in a particular position or are professionally invested in one, think. I know very few people who would speak out against their own pet theories, or actively look for evidence against them. It is easier to just suppress data that doesn’t match the theory and keep doing deep data-dives till something bears fruit. This is essentially what Fiske et al also advocate.

        • “My guess is that she sees herself as a classical Popperian, building theories, making predictions, and testing them with experiment.”

          Sort of ironically, it brings to mind an anecdote Popper used to recount from his youth in Vienna ;

          “As for Adler, I was much impressed by a personal experience. Once, in 1919, I reported to him a case which to me did not seem particularly Adlerian, but which he found no difficulty in analyzing in terms of his theory of inferiority feelings, although he had not even seen the child. Slightly shocked, I asked him how he could be so sure. “Because of my thousandfold experience,” he replied; whereupon I could not help saying: “And with this new case, I suppose, your experience has become thousand-and-one-fold.”

      • Shravan points to a style of argument and Andrew points to how Fiske may be thinking. I would aver that the style of argument derives directly from a protective stance. Fiske pivots from specifics to a general analogy, which is a red flag of lack of confidence in winning the argument on the merits. “We thought the world was ending then and now you know it was not” is really just a flat denial in the guise of an argument.
        Notice that Andrew’s time-reversal heuristic keeps the attention on the actual studies on hand, thinking about specific results in a different, interesting, and useful way. Fiske’s time reversal is more general, distracting from the actual studies which should be the basis for any analytical discussion. ” We thought the world as we know it was ending – today we know it was not – therefore, the world as we know it can never end.” Which is specious on its face. Applying time reversal to that, one could ask – in the 1980s, would you have thought it possible that psychology could ever be in crisis in the future? in 30 years? Ever? If not, why not? This shift from the specifics to the general by analogy has intuitive appeal and is deceptively “soothing”, but is not analytical, and it discards the data(the specific studies) which makes this method of argument contrary to science.

        • ““We thought the world was ending then and now you know it was not” is really just a flat denial in the guise of an argument.”

          I see the statement “We thought the world was ending then and now you know it was not” as a flat denial, but not in the guise of an argument.

    • No offence to academia, but to say “All this is normal science, not crisis. A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects. Of course studies don’t always replicate.” is just nonsense. I recently wrote a piece for our Clinical College Journal in which I referenced a paper where the test statistics were just wrong. The errors meant that the authors, possibly, incorrectly concluded that depression does not lower performance on a test of memory designed to detect “faking”, for want of a better term. Their inaccuracy meant that a clinician, therefore, could not be clear whether the presence of low mood might indicate that people were faking or whether that is irrelevant. It was the ‘go to’ paper. Of course I am sure the N=24 and N=26 in the groups had nothing to do with it :( Rather than “limiting conditions and contextual effects”, the research around this ‘measure’ trundled onward until clinicians became so uncomfortable with the test that best practice was defined as using an alternative. Academic research did not check context and academic research did not find limiting conditions. In the meantime, no doubt many people had this test presented and then, potentially, used as evidence upon which to make decisions about their insurance cover following injuries. The impact, which is very real, of unscientific research with errors that can be avoided seems to not enter consideration the embarrassingly fluffy polemics above. Contrast this with the Vioxx scandal, where the perpetrators of science that had yet to be “contextualised” etc. were held accountable, not so much for the fraud but for the impact on the public. Alternative facts people!

      • Re “A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects”:

        This seems like just a minor variant of something I remember hearing (in the seventies or eighties?), that mathematics had a bad image because we talked about “problems”, so we should change our vocabulary to be more positive by reframing “problems” as “opportunities.”

    • Shravan: Speaking of Susan Fiske, she was on a panel at the recent SPSP (Society for Personality and Social Psychology) annual conference. She stated there that she does not do social media herself but relies on what people tell her about it. That means that her “methodological terrorism / self-appointed data police” essay was based on gossip, hearsay, or whatever you want to call it. Also, Fiske continued to insist that people’s (meaning social psychologists’, I guess) careers have been ruined by social-media criticism of their research. I cannot think of any social psychologist whose career has been ruined in this way. People like Bargh, Baumeister, Schwarz, Gilbert, Schnall, Dweck, etc., are not shrinking violets; rightly or wrong, they fight back. The one possibility is Amy Cuddy, who was Fiske’s graduate student. An article on the web says that she turned down the tenure offer from Harvard in order to bring her message to a wider audience. Perhaps that isn’t true; perhaps criticism of her power-pose research did result in denial of tenure. Who knows?

  11. I honestly don’t understand how to distinguish between cases where failure to replicate is evidence that effects reported in the original study are (A) wrong or (B) contextual. I’d appreciate it if anyone could point me to a few best reads on this, especially any describing formal frameworks for distinguishing. I’d guess that would be in the causal inference literature, but I don’t know.

    Here’s an example of my confusion:
    Suppose a US study finds obesity is associated with a rare cancer, then a larger Japanese study reports no apparent association. Is this evidence that the US study was a fluke? Or are the different standards of obesity in Japan responsible?
    Imagine another study applies US obesity standards to Japanese data, again finding no apparent association. Do we then conclude the US study was a fluke? Or could the difference be explained by different observed confounder values? Or different unobserved confounders?

    • Anon:

      In many of these studies (for example, the ovulation-and-voting example which we used an example), the original study is so noisy that it provides essentially zero information about the phenomenon in question. So if the claim from the original study is correct, it’s by accident or because the researchers happened to know the correct direction and so steered the analysis that way. In such studies, there’s really nothing to replicate cos there’s nothing there in the first place. I think power pose and those Wansink studies are the same sort of thing.

      • Andrew: Thanks for the explanation. For studies that aren’t so flawed, are there formal methods for distinguishing between “wrong” and “contextual” findings based on replications?

        • What incentive would we have to do that? It’s much easier to ignore the possibility of moderation and instead assume malice or incompetence on the part of the researcher.

        • Psy: From your name, you seem to be in psychological research. I am not. Suppose methods were developed to conclude, “This failed replication is unlikely to be due to differences in the observed characteristics of participants in the original study and the replication,” or, “This failed replication seems attributable to differences in the observed characteristics of participants in the original study and the replication.”

          If such methods were available, do you think researchers faced with a failed replication would be more likely to concede that the original study could simply be wrong? Or do you think they would mostly double down by arguing that the failed replication is due to unobserved differences from the original study?

        • I think that would be tremendously valuable and people would really appreciate it. Though there are other reasons a replication study might yield different results. Cultural or historical change can affect results of especially studies of political and cultural topics. Study protocol differences as well.

    • You are thinking about replication in the wrong way (“is there an association or not?”). Just to start you off (and not that I agree with all):

      In fact, the traditional methods based on significance testing make it impossible to reach correct conclusions about the meaning of these studies. That is what is meant by the statement that traditional data analysis methods militate against the development of cumulative knowledge.
      […]
      As we have seen, traditional reliance on statistical significance testing leads to the false appearance of conflicting and internally contradictory literatures. This has a debilitating effect on the the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society.

      Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129. [ pdf ]

      • Anoneuoid: Thanks for the link! I didn’t intent to imply that the important question is “is there an association or not”. Sorry I was unclear. I only meant to say that there is a meaningful association in the US study (for example, moderate effect size estimate with confidence intervals excluding 0) and no meaningful association in the Japanese study (for example, very small effect size estimate with confidence intervals tight around zero).

        Have you seen articles on any formal framework that could be applied to distinguish whether the Japanese study should reduce confidence in the association reported in the US study, or instead be viewed as evidence that the association reported in the US study is contextual?

        • First, that paper I linked to was not about statistical vs practical significance, please check it out because I can see you are still suffering from the same misunderstandings as before I shared it. Secondly, the answer is no. It is not possible for the type of study you described. It is a poorly designed study, designed for NHST purposes rather than scientific ones.

          Instead the US researchers should have collected data (probably a timeseries, for cancer age-specific incidence is the best measure imo), described it, and tried to come up with a model that explains the observations. This model needs to be one from which a precise prediction can be deduced. I would start with the Armitage-Doll model.

          Then these predictions should be fit to the Japanese data which has different model parameters (eg % fat content of diet should increase division rates by x amount, or suppress the probability a cancerous cell is detected by the immune system by y %, or whatever). If it doesn’t fit the Japanese data, the model should be modified or outright rejected (depending on what alternative explanations are available).

          Also, what you describe is not really a replication. A replication study should have been performed in as similar circumstances to the original as possible (ie not in Japan). Finally, if no one could come up with a model for the original US data that could make predictions, then no conclusions should be drawn until someone does come up with predictions or the results could not be replicated under very similar circumstances (in which case there is no reason to “hang our hats” on the observations).

        • Anoneuoid: As I read it, the article you linked criticizes significance testing, and recommends point estimates and confidence intervals for meta-analysis. Sounds good!

          I’m not sure why you conclude the US and Japan studies are unscientific. I made them up as innocuous examples of a study and its replication. If you’d like to pick any other study that both looks good and has had a failed replication, that would be fine.

          In that case, my question is, “Are there good articles on formal methods of determining the following: Should a failed replication be viewed as suggesting (A) the actual effect is not what was reported in the original study or (B) the actual effect differs between the context of the original study and the replication?”

          You suggest that a mathematical, mechanistic models should be used, to allow the model fit on one study to be tested on data from the other. I agree this approach works in some fields, like physics. To my understanding, though, replications in many other fields do not and will not involve the kind of detailed mechanistic modeling of the variety you describe. In epidemiology and medicine, at least, this is usually infeasible because so much is biologically unknown and there are many free parameters.

          One can also transport findings from one population to another without detailed mechanistic models. A simple example is taking age-specific incidence from one population and applying it to the age distribution of a second population, in order to estimate overall incidence in the second population. A more complicated example is research on causal transportablity by Bareinboim and Pearl.

          Although replication is such a hot topic, I have’t seen research on transportability methods as a means of investigating how much of a failed replication is due to (A) conflict between the original and replicated findings and (B) different distributions of characteristics in the original study and replication (a context difference). I’m not sure if this hasn’t been done yet, if it’s impossible, if it’s impractical, or if I’m a really poor googler. So I would really like to see any related articles.

        • “my question is, “Are there good articles on formal methods of determining the following: Should a failed replication be viewed as suggesting (A) the actual effect is not what was reported in the original study or (B) the actual effect differs between the context of the original study and the replication?””

          I’d say no — there are too many factors involved. We (I originally wrote “one’, but one person is likely to be not enough) need to examine each study carefully to see how sound it is. And we always need to remember that there is always an element of randomness (and hence an element of uncertainty) in any sample.

          However, Andrew’s “time-reversal heuristic” does suggest that replications are likely to be more carefully done than the work they attempt to replicate, so when in doubt, tentatively give more credence to the replication, assuming it has no clear flaws in design or implementation.

        • I’m not sure why you conclude the US and Japan studies are unscientific. I made them up as innocuous examples of a study and its replication.

          Because they are designed to look for an effect, rather than collect detailed enough data to figure out what is going on.

          You suggest that a mathematical, mechanistic models should be used, to allow the model fit on one study to be tested on data from the other. I agree this approach works in some fields, like physics.

          It works there because physicists have a long history of running studies that go beyond “is there an effect”.

          In epidemiology and medicine, at least, this is usually infeasible because so much is biologically unknown and there are many free parameters.

          There are such models in bio, but they get almost no attention. This is because the vast majority of the funding has been going to people who design studies that merely “look for an effect”, rather than develop the models or collect data that narrows the plausible range of model parameters. If physics starts designing studies around NHST, you will see progress grind to a halt there (and possibly reverse) as well.

          how much of a failed replication is due to (A) conflict between the original and replicated findings and (B) different distributions of characteristics in the original study and replication (a context difference).

          The usual way to do this is incorporate the relevant characteristics into a mechanistic models. Then your model makes a prediction, and you compare that to the data.

  12. Non-psychologists may not realize this but the passage under discussion is the introduction to the Annual Review of Psychology – supposedly a collection of “go to” articles for anyone wishing to get up to speed in an area of psychology. So, a lot of graduate students will read this.

    I find it interesting that the same authors wrote a slightly different introduction to the 2015 issue of the Annual Review of Psychology which seems a little ironic looking back.
    http://www.annualreviews.org/doi/10.1146/annurev-ps-66-120414-100001

    Here is the 2015 text in full:

    Science communication is our mission at the Annual Review of Psychology. Our credibility in this enterprise requires both expertise and trustworthiness. Expertise is guaranteed by our distinguished editorial board and their selection of outstanding authors and cutting-edge topics, which assures solid science. The Annual Review of Psychology’s academic citation rates suggest that the articles succeed in earning the respect of our colleagues. The speed with which invited authors say yes also suggests that an Annual Reviews article carries weight.

    Besides being authoritative, credibility requires being trustworthy, in this case being a fair judge of the balance of the evidence. We strive for reviews that go beyond self-interest, so authors are urged to cover their own work as only a part of the larger literature. We aim for reviews that fairly summarize the findings, even if critiquing them. We ask that authors reveal conflicts of interest, if any, so readers can judge their objectivity. Our dedicated copyediting makes articles as reader friendly as possible. Our graphics aim to be both appealing and accurate.

    In all this, our audiences are both scientific colleagues and the educated public. Annual Reviews’ mission is to serve as an honest broker of scientific information for both audiences. We hope to be science communicators with worthy intentions as well as authoritative knowledge.

    • “Expertise is guaranteed by our distinguished editorial board and their selection of outstanding authors and cutting-edge topics, which assures solid science.” Ouch! Sounds like too much PR.

  13. Nailed it again, Andrew. The Old Guard won’t go quietly into the night, but they will go… H

    Vinay Prasad, assistant professor of medince at Oregon Health & Science University, quoted in Freakonomics, Bad Medicine, Part I:
    http://freakonomics.com/podcast/bad-medicine-part-1-story-98-6/
    “The reality was that what we were practicing was something called eminence-based medicine.” Fiske, Schachter & Taylor present a thinly-veiled defense of eminence-based psychology…

    Then there is this, from Edwards & Roy, 2017 (journal: Environmental Engineering Science):
    http://online.liebertpub.com/doi/full/10.1089/ees.2016.0223. From the abstract:

    “If a critical mass of scientists become untrustworthy, a tipping point is possible in which the scientific enterprise itself
    becomes inherently corrupt and public trust is lost, risking a new dark age with devastating consequences to humanity.”

    I do think they had a revealing Freudian slip, though. They wrote this:
    “Annual Reviews provides an additional remedy that is also from the annals of normal science: the expert, synthetic review article.”

    Dictionary.com has several definitions of “synthetic.” The first two are about chemical synthesis. The third and fourth have something to do with language.

    The fifth is this: “Not real or genuine; artificial; feigned.”

    “Normal science” indeed…

  14. “When we were graduate students, psychology was in “crisis,” raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Researchers rose to the challenges, and psychological science soldiered on.”

    OK, what was the big change in response to this crisis? This is the field I did my doctoral work in, and I don’t remember any big paradigm shift. (Given our ages, Fiske and I were likely in doctoral programs at the same time.)

    • You want a big paradigm shift? Amazon’s Mechanical Turk?!

      Hey, that’s big! Crap can scale now. And on a shoestring budget too.

      Really a shot in the arm for the cottage industry of crappy psych results.

  15. My cartoon understanding of Multiverse theory is that there are alternative dimensions/universes out there where infinitely many other copies of Fiske et al are playing out, and all possible variations are unfolding. From one such alternative dimension comes this lightly edited text of Fiske et al’s text. There is a Fiske out there who thinks this way (although her name might be Carney or something like that). It’s just an accessible possible world for us.

    Psychology is not in crisis, contrary toconsistent with popular rumor. Every few decades, critics declare a crisis, point out problems, and sometimes motivate solutions. When we were graduate students, psychology was , as today, in “crisis,”crisis, raising concerns about whether it was scientific enough. Issues of measurement validity, theoretical rigor, and realistic applicability came to the fore. Unfortunately, researchers Researchers rose toignored the challenges, and psychological science blithely soldiered on.

    This decade, the crisis implicates biomedical, social, and behavioral sciences alike, and the focus is replicability. First came a few tragic and well-publicized frauds; fortunately, they are rare—though never absent from science conducted by humans—and they were caught. Now the main concern is some well-publicized failures to replicate, including some large-scale efforts to replicate multiple studies, for example in social and cognitive psychology. National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will hopefully improve, and researchers will rememberreview their less than adequate training and learn new standards.

    All this is normal science, not just crisis. A replication failure is not a scientific problem; but it is also an opportunity to find limiting conditions and contextual effectsimprove. Of course studies don’t always replicate ; with such sloppy methods, what else did you expect?.

    Annual Reviews provides an additional remedy that is also from the annals of normal science: the expert, synthetic critical review article written by statisticians. As part of the cycle of discovery, novel findings attract interest, programs of research develop, scientists build on the basic finding, and inevitably researchers discover its boundary conditions and limiting mechanismsthat most of it is noise-mining and deep-data diving. Expert reviewersstatisticians periodically step in, assess the state of the science—including both dead ends and well-established effects—and identify new directions. Crisis or no crisis, the field developsis unlikely to develop consensus about the most valuable insights unless it takes a cold, hard look at itself. As editors, we are impressed by the patterns of discoverycritical attitude affirmed in every Annual Review article.

  16. Waiting for Stan to run, I looked at Princeton’s graduate stats requirements. It seems it’s a single course: Quantitative Analysis in Psychological Research (PSY 503).

    https://psych.princeton.edu/program-requirements

    This is also about what I had in linguistics at Ohio State. It’s just not enough. There should be a one-and-a-half year sequence of three courses, at the very least, and taught by statisticians who know something about psych and can communicate with humans.

  17. Andrew: The Fiske, Schacter, and Taylor Annual Review intro that you give above is for 2016. Have you seen the one for 2017?

    http://www.annualreviews.org/doi/full/10.1146/annurev-ps-68-120216-100001

    It begins:

    “Wisdom does not often appear in rapid-fire social media posts. Wisdom takes time and thought, the accumulation of knowledge informed by evidence and experience, and the crafting of careful prose ….”

    This is first-authored by someone who has admitted publicly that she does not do social media, so how does she know that “wisdom does not often appear”?

    • Carol:

      It’s worse than that. They actually write this:

      Our articles are carefully reviewed by devoted colleagues and by the editors. So, take the time to experience some curated wisdom from hand-picked experts. It’s time well spent.

      The same experts who vetted himmicanes and air rage, huh?

      I think my time would be better off spent watching some Game of Thrones.

    • She took 8 posts at random from Twitter, decided 7 of them were nonsense, and rejected the null that most rapid-fire social media posts contain wisdom, using a one-sided binomial test of the null that half of all rapid-fire social media posts have wisdom. p=0.035156 (Those last decimal points makes it scientific)

  18. As a social psychologist, I think what I find most disappointing is that all these esteemed names in the field are either unwilling or unable to apply the knowledge of their field to themselves. We who write ad nauseam about every type of “cognitive bias” known to humankind, as well as our blindness to having them, can’t acknowledge that they may be affecting our response to criticism? Normal, perhaps, yes. But as psychologists, we should hold ourselves to a higher standard.

  19. 1.) Ban mull hypothesis significance testing.

    2.) Promote so-called single-subject designs where appropriate (much of psychology and medicine).

    3.) Where SSDs cannot be used, promote the notion that there are no magic values that can replace what a p-value is alleged to be (i.e., an arbiter of whether or not there was an effect).

    Cordially,
    Glen

Leave a Reply to Psy Cancel reply

Your email address will not be published. Required fields are marked *