It comes down to reality and it’s fine with me cause I’ve let it slide

E. J. Wagenmakers pointed me to this recent article by Roy Baumeister, who writes:

Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.

In fact, one effect of the replication crisis can even be seen as rewarding incompetence. These days, many journals make a point of publishing replication studies, especially failures to replicate. The intent is no doubt a valuable corrective, so as to expose conclusions that were published but have not held up.

But in that process, we have created a career niche for bad experimenters. This is an underappreciated fact about the current push for publishing failed replications. I submit that some experimenters are incompetent. In the past their careers would have stalled and failed. But today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work and thereby publishing a series of papers that will achieve little beyond undermining our field’s ability to claim that it has accomplished anything.

I [Baumeister] mentioned the rise in rigor corresponding to the decline in interest value and influence of personality psychology. Crudely put, shifting the dominant conceptual paradigm from Freudian psychoanalytic theory to Big Five research has reduced the chances of being wrong but palpably increased the fact of being boring. In making that transition, personality psychology became more accurate but less broadly interesting.

Poe’s Law, as I’m sure you’re aware, “is an Internet adage which states that, without a clear indicator of the author’s intent, parodies of extreme views will be mistaken by some readers or viewers for sincere expressions of the parodied views.”

Baumeister’s article is what might be called a reverse-Poe, in that it’s evidently sincere, yet its contents are parodic.

Just to explain briefly:

1. The goal of science is not to reward “flair, intuition, and related skills”; it is to learn about reality.

2. I’m skeptical of the claim that “today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work.” I’d be interested in who this experimenter is who had this impressive career.

In fact, the incentives go in the other direction. Let’s take an example. Carney et al. do a little experiment on power pose and, with the help of some sloppy data analysis, get “p less than .05,” statistical significance, publication, NPR, and a Ted Talk. Ranehill et al. do a larger, careful replication study and find the claims of Carney et al. to be unsupported by the data. A look back at the original paper of Carney et al. reveals serious problems with the original study, to the extent that, as Uri Simonsohn put it, that study never had a chance.

So, who’s the “broadly incompetent experimenter”? The people who did things wrong and claimed success by finding patterns in noise? Or the people who did things carefully and found nothing? I say the former. And they’re the ones who were “amassing a series of impressive publications.”

Baumeister’s problem, I think, is the same one as the problem with the “statistical power” literature, which is that he sees “p less than .05,” statistical significance, publication, NPR, Ted Talk, Gladwell, Freakonomics, etc., as a win. Whereas, to me, all of that is a win if there’s really a discovery there, but it’s a loss if it’s just a tale spun out of noise.

Here’s another example: When Weakliem and I showed why Kanazawa’s conclusions regarding beauty and sex ratio were essentially unsupported by data, this indeed “undermined psychology’s ability to claim that it has accomplished anything” in that particular area—but it was a scientific plus to undermine this, just as it was a scientific plus when chemists abandoned alchemy, when geographers abandoned the search for Atlantis, when biologists abandoned the search for the Loch Ness monster, and when mathematicians abandoned the search for solutions to the equation x^n + y^n = z^n for positive integers x, y, z and integers n greater than 2.

3. And then there’s that delicious phrase, “more accurate but less broadly interesting.”

I guess the question is, interesting to whom? Daryl Bem claimed that Cornell students had ESP abilities. If true, this would indeed be interesting, given that it would cause us to overturn so much of what we thought we understood about the world. On the other hand, if false, it’s pretty damn boring, just one more case of a foolish person believing something he wants to believe.

Same with himmicanes, power pose, ovulation and voting, alchemy, Atlantis, and all the rest.

The unimaginative hack might find it “less broadly interesting” to have to abandon beliefs in ghosts, unicorns, ESP, and the correlation between beauty and sex ratio. For the scientists among us, on the other hand, reality is what’s interesting and the bullshit breakthroughs-of-the-week are what’s boring.

Anyway, I read through that article when E. J. sent it to me, and I started to blog it, but then I thought, why give any attention to the ignorant ramblings of some obscure professor in some journal I’d never heard of.

But then someone else pointed me to this post by John Sakaluk who described Baumeister as “a HUGE Somebody.” It’s funny how someone can be HUGE in one field and unheard-of outside of it. Anyway, now I’ve heard of him!

P.S. In comments, Ulrich Schimmack points to this discussion. One thing I find particularly irritating about Baumeister, as well as with some other people in the anti-replication camp, is their superficially humanistic stance, the idea that they care about creativity! and discovery!, not like those heartless bean-counting statisticians.

As I wrote above, to the extent these phenomena such as power pose, embodied cognition, ego depletion, ESP, ovulation and clothing, beauty and sex ratio, Bigfoot, Atlantis, unicorns, etc., are real, then sure, they’re exciting discoveries! A horse-like creature with a big horn coming out of its head—cool, right? But, to the extent that these are errors, nothing more than the spurious discovery of patterns from random noise . . . then they’re just stories that are really “boring” (in the words of Baumeister) stories, low-grade fiction. The true humanist, I think, would want to learn truths about humanity. That’s a lot more interesting than playing games with random numbers.

40 thoughts on “It comes down to reality and it’s fine with me cause I’ve let it slide

  1. While I agree with the thrust of Andrew’s comments, I think we need to be cautious about one thing. We start with a published study that showed a “significant” result and attempt to replicate it. If we are motivated by the desire to show that it doesn’t replicate, it is easy enough to make that happen by using the sloppiest measurements we can get away with. By adding enough noise to our data, we can wash out whatever effect might actually be there. This is the same reason that the FDA was long unaccepting of, and is still skeptical of, non-inferiority trials for drugs: if you design the study badly enough, you can almost guarantee your new drug will appear to be “non inferior” to the comparator.

    The point is that in judging replication studies, they can only be counted as honest replications if the methods were at least as rigorous as the original study. For the examples that Andrew cites, that’s a very low bar, indeed. But we do need to be alert to the possibilities.

    • Similar to what you say, a replication is only a direct replication if the new study uses the same methods as the previous study. Once you start changing the measures, testing paradigm, etc, you’re into the world of conceptual replications, which (should) carry less weight. Some people have suggested that it’s important to include the original author in a direct replication attempt in part so that the methods can be matched as well as possible.

      Short version: if someone is running a direct replication as many people would suggest, the methods should only be as sloppy as the original study and there should be no concern on that front as to a non-significant result.

      • Conceptual replication should carry a lot of weight if we care about learning something about the world. If a result is so brittle that it evaporates under tiny perturbations of the experimental setup, how much of a discovery was it? Moreover, do you actually believe it would replicate with the exact experimental conditions when it didn’t for very similar conditions?

        When the media cover a discovery, with fame, fortune and Ted talks accruing to the discoverers, they do so presuming something general was found. Showing no such thing was found with a conceptual replication is a good contribution to science.

        A failure to conceptually replicate tell us a lot about the initial study. Whilst some (or more, there is always some) uncertainty may remain whether there would be an effect under the exact conditions, we learn that there isn’t any with similar conditions, greatly reducing the scope and value of the initial discovery.

        • Agreed, but why run a conceptual replication if you don’t think/aren’t sure that the original result actually holds or not? Run a direct replication or two, feel confident that you’re looking at something real, and then explore the boundaries.

        • Suppose that when standing within a small box painted on my deck at 2pm on the solstice you can see a rainbow of colors beaming at you that fills you with good feelings… we’ll call this “rainbow power pose”. Now, i’ll go on the TED talk and discuss the important of rainbow power pose for the well being of society. I’ll say that if you stand facing east at 2pm with your arms spread wide and you will receive the electromagnetic rays of the sun in such a way as to enhance the production of serotonin in your body that will lead to better family relations and improved career outlook, and weight loss.

          Now you try to conceptually replicate this by going out on your deck in Stuttgart Germany on the solstice facing east at 2pm. Note you’ve even taken care to make it the solstice which wasn’t actually reported in the TED talk… but it’s not actually on my deck inside the white painted box….

          You find that mainly you get a sunburn but no rainbow power pose….

          You could fly all the way to Los Angeles to try to reproduce this effect at cost of say $2000 for airplane tickets and a hotel room, or we could agree that whatever rainbow power pose might actually occur wouldn’t be all that helpful to society. So, should we replicate my rainbow power pose first or is it ok to just try to replicate this conceptually in the broader world where your neighbor hasn’t hung a shiny quartz crystal in their kitchen window because the general concept of rainbow power pose doesn’t actually involve neighbors with quartz crystals and certainly the TED talk doesn’t mention the neighbor…. ?

          The point is, conceptual replications failures are strong indicators that even if something’s going on, it probably doesn’t matter to society, and even if it does matter to society, the conceptual explanation for why it occurs is most likely bunk and the original researcher should go back to the drawing board and figure out what’s really going on. If it fails a conceptual replication I think the onus is on the original researcher to elucidate what their theory is and define it better and replicate it in a broader setting… there are too many potentially bogus and highly limited claims to have independent people check them all out in careful direct replications

        • I don’t disagree with you broadly, but it is also some people’s opinion (e.g., Pashler’s work referenced in http://statmodeling.stat.columbia.edu/2014/09/03/disagree-alan-turing-daniel-kahneman-regarding-strength-statistical-evidence/) that conceptual replications have contributed to the current replication crisis by allowing researchers to move the goalposts.

          But to play devil’s advocate: let’s say that between your TED talk and the trips to Germany or LA I sent a few hundred people to your deck and they largely confirmed the rainbow power pose. That would be interesting even after the Germany/LA data, no? Perhaps you’d start to make some money off of your limited-validity deck.

        • Is this what professors get paid the big bucks for? Finding out that when 17 to 22 year old college students stand in their research labs inside a white box that they are able to adopt a rainbow power pose that increases the circumference of their arms and thereby makes the males slightly more politically conservative while the females are induced to wear redder shirts but only when the weather is not too cold and they are ovulating (p < .05, not 0.05 because it's impossible to completely disprove a statistical hypothesis with a single sample? )

        • The way conceptual replications are used is so that no one is ever wrong. Then a variety of conflicting results are generated that do nothing but allow people waste each others time and money. Imagine a series of studies like this:

          The first finds the drug-treated group had slower cancer growth in the case of 6 month old male rats. Then a replication is done to see if the drug also works for 3 month old female rats. It turns out it “doesn’t work” (it is speculated this is due to some hormonal thing), but then another study reports the hormone levels have no correlation in the case of 7 month old female rats. In this case the drug does seem to “work” as long as the dose is doubled from the earlier studies. However, the original authors report again on the topic that for 3 week old rats the drug does indeed work at the original dosage, but you have to measure cancer growth using surface area rather than circumference of the tumor (as was done for the earlier studies, including theirs).

          The problem is all of this information is not at all verified, and it isn’t worth coming up with any real theory to explain it. There is nothing of value that can be produced if no one is verifying anything, it is a total waste of time.

  2. “It’s funny how someone can be HUGE in one field and unheard-of outside of it.”

    It’s tough — you only get heard of out of your field thanks to NPR, TED talk, etc. So maybe that ought to be (or is) the norm, HUGE in one field and unheard-of outside of it.

  3. Of course. Yet… Does it make sense to keep the cantankerous, representative parties walled off from the judge – as the truth machine of science publication claims it must

  4. Andrew, I am – as you know – a huge skeptic of the Carney, Cuddy, and Yap power pose article. The paper is a complete mess. But a close look at the Ranehill paper also reveals some pretty substantial problems. They too did not eliminate the gambling confound (gambling influences testosterone and hormones) although they tried to reduce it by not informing participants about whether they had lost or won on the gambling task. The Ranehill data is also characterized by absolutely insane changes in testosterone and cortisol levels for some participants over what was probably a 40 minute period. For example, the testosterone score of one woman went from 17 to 118 while that of another changed from 90 to 15. One man’s testosterone level went from 117 to 11 while another changed from 13 to 75. Similar dramatic changes were also found for cortisol for some participants. Something was clearly wrong with their hormone measurement, and while I don’t for a moment believe that the claimed power pose effects are real the Ranehill paper does seem a little odd to me.

    As regards Baumeister – he is (or was) a big deal but most of his work has been viewed with extreme skepticism among my circle of acquaintances in psychology for a very long time. I know plenty of researchers who tried to replicate his work and failed but all those studies were buried in a file drawer somewhere. One of them apparently contacted Baumeister to let him know that she could not replicate an effect he had described. He allegedly told her to “keep trying until it works” – or something to that effect.

    • Hi Mark and everyone,

      Thanks for raising this issue and looking at the data (that is why we posted it online :) ). In the hormone literature observations are sometimes considered “outliers” if they are more than 3 standard deviations away from the sample average. We looked at whether there were any outliers according to this definition in our data, for men and women separately, for the first hormone measurement of testosterone (T) and cortisol (C) separately, for the second measurement of T and C, and for the change in T and C. With the 3 standard deviation cutoff, we did not find any outliers. But as you noted to us in a private email after you had looked at the data it does not look like excluding those observations about which you were skeptical has any impact on the inferences we can draw from our data. Importantly, we have fairly high correlations between the initial and the final hormonal measurements (Correlation T1 and T2 = 0.8645, and C1 and C2 = 0.6128). In an unpublished data set, compiled by one of the authors, with 186 men the corresponding numbers are 0.7262 and 0.5654, so the correlations don’t seem to be unusually low.

      Whether gambling actually influences hormones on the other hand is another open question – there is some published research that suggests it does, but more is being written up right now where we do not find evidence of this.

      Best,

      Eva Ranehill et al

      • Hi everyone,
        just a quick correction to my post above. I realized I had a typo in my code when classifying outliers. We do have a few outliers (2% or less for each measurement), but, as you noted in your email Mark, we also find that excluding these observations doesn’t change the results qualitatively.

  5. Roy Baumeister’s work on ego-depletion has generated a huge mounted on evidence, but a meta-analysis that corrects for bias finds nothing. A pre-registered replication study found nothing. In response, Roy Baumeister was unable to name a single paradigm that could produce the basic effect, but he has amassed over 100,000 citations. Fame != Great Science

    https://replicationindex.wordpress.com/2016/04/18/is-replicability-report-ego-depletionreplicability-report-of-165-ego-depletion-articles/

  6. Another concern often voiced by the “skeptics of the need for and value of replication” is that careers will be harmed by excessively zealous attempts to verify published findings. What that concern neglects, of course, is the careers that were already harmed by the publication of incorrect truth claims. Those would be the careers of the researchers who were excluded from the high-profile journals the original papers appeared in, and therefore didn’t get or retain the fancy jobs that the original researchers got or retained as a result of their unreplicable work. (Never mind the harm done to truth, etc.) So the concern with the lives and careers of the “targets of replication attempts” seems superficial. As you have noted, if the experiments were done in the reverse order, the situation would be completely different.

  7. I think “less broadly interesting” is intended to mean “less interesting to most people.” The claim that a Power Pose will bring you wealth and fame are more “broadly interesting” than a claim that it won’t do very much.

    I feel like you are misinterpreting Baumeister’s message. Baumeister is NOT saying people shouldn’t do replications (or publish them), he’s pointing out that a system of (publish bad studies and then try to replicate them and publish those results too) is bad. He is not saying it would be better to (publish bad studies and DON’Ttry to replicate them), which is what I think you think he’s saying.

    • Phil:

      It is my impression that Baumeister is saying that studies like power pose etc. are not bad studies, he’s saying they’re good studies because they’re interesting. My point is that they’re interesting only if true. If the claims are in error, those studies are boring (except in the sociological sense that it’s interesting that bad work can get published and promoted).

      • Not only are they interesting only if true, they are actually dangerous if false. People following inert or harmful advice are not following good advice, so there are opportunity costs and possibly direct costs.

      • I would say you could add a modifier to “interesting”: these studies are “potentially” interesting, meaning they would or even will be interesting if or when proven true. I’m trying to distinguish two things: that which is demonstrated in some way and which thus generates further work because truth has consequences and that which might be true (or false) and which thus generates further work to see if it might be true. One can then argue these additional studies – low powered, noisy as they are – are further work that fails to prove truth and which, taken as a whole, then become a substantial disproof even though undertaken for other reasons, including the need to publish something and intellectual carelessness. Somewhat tongue in cheek, these crappy studies may amount to the kind of substantial exploration of negative cases that science often doesn’t recognize in its search for positive results, that constant publishing of results which qualify as “positive” but which really aren’t actually has value by amounting to the disproof over time of all such nonsense. Is there a way to insert a smiley face here other than :)?

    • The article is paywalled and I’m not willing to pay $36 to read it. Andrew, if you’ve read it and Baumeister says somewhere in it that he prefers false but “broadly interesting” findings to true but less interesting ones then I am wrong about this. But in what you’ve posted I don’t see even a whiff of a suggestion that Baumeister thinks the way you say he thinks. My impression is that Baumeister thinks “personality psychology” just isn’t a very interesting field, and that results that make it seem interesting are likely to be false. I doubt he thinks that’s a good thing. But it’s hard to judge from just the bits you posted.

      • Phil:

        Baumeister seems to like his own theories which, on the balance of the evidence, seem to me to be false. So, yes, I think he likes at least one set of false theories. What he doesn’t like is to know they are false.

        To return to the Loch Ness monster example, Baumeister is holding in his hands a blurry photo of Nessie, and he is really really annoyed by people who want to take the magic out of the world by going up to Loch Ness to find out if anything’s really there. He’d rather remain in a state of blissful credulity than confront the data that suggest his theories are wrong.

  8. Your list of fabulous things that come to the perpetrators of this sort of “research” should not have stopped at TED talk and Gladwell. You missed out “high-six-figure book deal” (and, in at least two cases that I am aware of, “seven-figure book deal”). You also missed “Influence over public policy, potentially costing taxpayers hundreds of millions of dollars”. This stuff /a/ does real damage and /b/ makes certain people seriously wealthy.

    • Wait! You’re saying that it’s not okay if you simply say that you “never tried to oversell your findings” while simultaneously going on a book and speaking tour touting your discredited effects to the public and US Government?

  9. Aren’t “Flair, intuition, and related skills” supposed to be the requirements for a successful artist?

    Isn’t science supposed to be “boring” in nature because it aims to be bound as tightly as possible to reality?

    Is Baumeister suggesting that science should be replaced with art and sales pitches?

    Has that already begun to happen?

      • I heard from one of Sir Richard Peto’s https://www.ndph.ox.ac.uk/team/richard-peto colleagues that he advised one to think of their academic career as just being a theatrical career.

        I actually think that was very good advice as no matter how good your work is, if it does not get fully aired, it will have little impact.

        On the other hand, his refusal to give talks on or discuss his infamous O-E method for meta-analysis at conferences/workshops does suggest that past performances that were reviewed positively by many but then by some very negatively – tend to not be repeated.

        • From the (translated) book by social psychologist Diederik Stapel (p. 90):

          “I was successful, and I got applause. That made me want more. After all, I loved the theater. The actor in me was reawakened and I started to set up my classes so that by the end, the students in “the house” would be so under the influence of the performance that after the last, carefully-planned silence from the stage, all they would be able to do was break out in emotional applause.”

          source: https://errorstatistics.files.wordpress.com/2014/12/fakingscience-20141214.pdf

  10. The second point you make — who is this person that has this career? — is so spot on I just had to comment. I recently published a finding that was critical of a study conducted some years ago. It took 6 years for us to publish the critical paper because of all the blockades to publishing replications. Fun fact: the original study had 9 subjects and our study has over 2000.

  11. “a significant result with n = 10 often required having an intuitive flair…Flair, intuition, and related skills matter much less with n = 50.”

    Pshhhh. I remember the good old days when we had to find our “significant effects” with n=3. Today’s researchers are so damn coddled!

    [God, so many things wrong with his message: e.g., “More data is bad!”]

  12. His comments seem to me like a variety of other public controversies where we will never reach the truth, so you have people fighting over which type of wrongness is better. You can think of it like the “how many guilty men would you allow to roam free in order to ensure that one innocent man isn’t imprisoned?” type of question.

    The problem here, like so many other cases, is scale. Sure, one of the dangers of encouraging replication is that there will inevitably be high-profile failed replications that are the result of the experimenter lacking the proper skillset to run the experiment well (and either not have the knowledge or ethics to recognize it). p = 1 that this will happen!

    But a general reluctance to replicate or publish replications will lead to more bad findings taken as truth than the possibility of bad replications causing similarly distorted views. And the peer review process is designed to help root out those who make poor choices about methods, but that same process is less suited to passing judgments on whether a finding was anomalous.

    Last of all is that an errant finding—a Type 1 error—can have the effect on causing certainty about something when it isn’t called for (power poses do [something good]). A replication that makes a Type 2 error, on the other hand, should merely raise doubt about something about which we once had certainty. The Type 1 error is likely to cause the state of research on that specific question to go from 0% positive findings to 100% positive findings (the same would be the case if it was a “true positive”). Smart people should know enough to have skepticism on such a flimsy basis, true or false positive, but I think psychologically the absence of contradictory evidence probably makes for more confidence than is warranted. If the Type 2 error publication is allowed to exist, you go to 50% certainty. If you’re talking about an underlying truth that was missed in the replication, then the body of work puts you closer to believing it than had there been no publications. But if the original article was a false positive, and there is no corrective issued, then your sole mechanism to keep us from having a false impression is to hope the error was clear to peer reviewers. It’s either that or you hope for “conceptual replications” about which the respective authors will complain about how the findings don’t match because the others’ methods were different and wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *