The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of claims about power pose)

[Note to busy readers: If you’re sick of power pose, there’s still something of general interest in this post; scroll down to the section on the time-reversal heuristic. I really like that idea.]

Someone pointed me to this discussion on Facebook in which Amy Cuddy expresses displeasure with my recent criticism (with Kaiser Fung) of her claims regarding the “power pose” research of Cuddy, Carney, and Yap (see also this post from yesterday). Here’s Cuddy:

This is sickening and, ironically, such an extreme overreach. First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). So yes, I did respond to the peer-reviewed paper. The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture. I’ve been quiet and polite long enough.

There’s a difference between having your ideas challenged in constructive way, which is how it used in to be in academia, and attacked in a destructive way. My “popularity” is not relevant. I’m tired of being bullied, and yes, that’s what it is. If you could see what goes on behind the scenes, you’d be sickened.

I will respond here but first let me get a couple things out of the way:

1. Just about nobody likes to be criticized. As Kaiser and I noted in our article, Cuddy’s been getting lots of positive press but she’s had some serious criticisms too, and not just from us. Most notably, Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber published a paper last year in which they tried and failed to replicate the results of Cuddy, Carney, and Yap, concluding “we found no significant effect of power posing on hormonal levels or in any of the three behavioral tasks.” Shortly after, the respected psychology researchers Joe Simmons and Uri Simonsohn published on their blog an evaluation and literature review, writing that “either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it” and concluding:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

OK, so I get this. You work hard on your research, you find something statistically significant, you get it published in a top journal, you want to draw a line under it and move on. For outsiders to go and question your claim . . . that would be like someone arguing a call in last year’s Super Bowl. The game’s over, man! Time to move on.

So I see how Cuddy can find this criticism frustrating, especially given her success with the Ted talk, the CBS story, the book publication, and so forth.

2. Cuddy writes, “If you could see what goes on behind the scenes, you’d be sickened.” That might be so. I have no idea what goes on behind the scenes.

OK, now on to the discussion

The short story to me is that Cuddy, Carney, and Yap found statistical significance in a small sample, non-preregistered study with a flexible hypothesis (that is, a scientific hypothesis that posture could affect performance, which can map on to many many different data patterns). We already know to watch out for such claims, and in this case a large follow-up study by an outside team did not find a positive effect. Meanwhile, Simmons and Simonsohn analyzed some of the published literature on power pose and found it to be consistent with no effect.

At this point, a natural conclusion is that the existing study by Cuddy et al. was too noisy to reveal much of anything about whatever effects there might be of posture on performance.

This is not the only conclusion one might draw, though. Cuddy draws a different conclusion, which is that her study did find a real effect and that the replication by Ranehill et al. was done under different, less favorable conditions, for which the effect disappeared.

This could be. As Kaiser and I wrote, “This is not to say that the power pose effect can’t be real. It could be real and it could go in either direction.” We question on statistical grounds the strength of the evidence offered by Cuddy et al. And there is also the question of whether a lab result in this area, if it were real, would generalize to the real world.

What frustrates me is that Cuddy in all her responses doesn’t seem to even consider the possibility that the statistically significant pattern they found might mean nothing at all, that it might be an artifact of a noisy sample. It’s happened before: remember Daryl Bem? Remember Satoshi Kanazawa? Remember the ovulation-and-voting researchers? The embodied cognition experiment? The 50 shades of gray? It happens all the time! How can Cuddy be so sure it hasn’t happened to her? I’d say this even before the unsuccessful replication from Ranehill et al.

Response to some specific points

“Sickening,” huh? So, according to Cuddy, her publication is so strong it’s worth a book and promotion in NYT, NPR, CBS, TED, etc. But Ranehill et al.’s paper, that somehow has a lower status, I guess because it was published later? So it’s “sickening” for us to express doubt about Cuddy’s claim, but not “sickening” for her to question the relevance of the work by Ranehill et al.? And Simmons and Simonsohn’s blog, that’s no good because it’s a blog, not a peer reviewed publication. Where does this put Daryl Bem’s work on ESP or that “bible code” paper from a couple decades ago? Maybe we shouldn’t be criticizing them, either?

It’s not clear to me how Simmons, Simonsohn, and I are “bullying” Cuddy. Is it bullying to say that we aren’t convinced by her paper? Are Ranehill, Dreber, etc. “bullying” her too, by reporting a non-replication? Or is that not bullying because it’s in a peer-reviewed journal?

When a published researcher such as Cuddy equates “I don’t believe your claims” with “bullying,” that to me is a problem. And, yes, the popularity of Cuddy’s work is indeed relevant. There’s lots of shaky research that gets published every year and we don’t have time to look into all of it. But when something is so popular and is promoted so heavily, then, yes, it’s worth a look.

Also, Cuddy writes that “somehow people missed that they STILL replicated the effects on feelings of power.” But people did not miss this at all! Here’s Simmons and Simonsohn:

In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. The key point of the TED Talk, that power poses “can significantly change the outcomes of your life”, was not supported.

In any case, it’s amusing that someone who’s based an entire book on an experiment that was not successfully replicated is writing about “extreme overreach.” As I’ve written several times now, I’m open to the possibility that power pose works, but skepticism seems to me to be eminently reasonable, given the evidence currently available.

In the meantime, no, I don’t think that referring to a non-peer-reviewed blog is “the worst form of scientific overreach.” I plan to continue to read and refer to the blog of Simonsohn and his colleagues. I think they do careful work. I don’t agree with everything they write—but, then again, I don’t agree with everything that is published in Psychological Science, either. Simonsohn et al. explain their reasoning carefully and they give their sources.

I have no interest in getting into a fight with Amy Cuddy. She’s making a scientific claim and I don’t think the evidence is as strong as she’s claiming. I’m also interested in how certain media outlets take her claims on faith. That’s all. Nothing sickening, no extreme overreach, just a claim on my part that, once again, a researcher is being misled by the process in which statistical significance, followed by publication in a major journal, is taken as an assurance of truth.

The time-reversal heuristic

One helpful (I think) way to think about this episode is to turn things around. Suppose the Ranehill et al. experiment, with its null finding, had come first. A large study finding no effect. And then Cuddy et al. had run a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any power-pose effect is fragile.

From this point of view, what Cuddy et al.’s research has going for it is that (a) they found statistical significance, (b) their paper was published in a peer-reviewed journal, and (c) their paper came before, rather than after, the Ranehill et al. paper. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take Cuddy et al. as the starting point in our discussion, just because it was published first.

What next?

I don’t see any of this changing Cuddy’s mind. And I have no idea what Carney and Yap think of all this; they’re coauthors of the original paper but don’t seem to have come up much in the subsequent discussion. I certainly don’t think of Cuddy as any more of an authority on this topic than are Eva Ranehill, Anna Dreber, etc.

And I’m guessing it would take a lot to shake the certainty expressed on the matter by team TED. But maybe people will think twice when the next such study makes its way through the publicity mill?

And, for those of you who can’t get enough of power pose, I just learned that the journal Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose! Publication is expected in fall 2016. So you can expect some more blogging on this topic in a few months.

The potential power of self-help

What about the customers of power pose, the people who might buy Amy Cuddy’s book, follow its advice, and change their life? Maybe Cuddy’s advice is just fine, in which case I hope it helps lots of people. It’s perfectly reasonably to give solid, useful advice without any direct empirical backing. I give advice all the time without there being any scientific study behind it. I recommend to write this way, and teach that way, and make this and that sort of graphs, typically basing my advice on nothing but a bunch of stories. I’m not the best one to judge whether Cuddy’s advice will be useful for its intended audience. But if it, that’s great and I wish her book every success. The advice could be useful in any case. Even if power pose has null or even negative effects, the net effect of all the advice in the book, informed by Cuddy’s experiences teaching business students and so forth, could be positive.

As I wrote in a comment in yesterday’s thread, consider a slightly different claim: Before an interview you should act confident; you should fold in upon yourself and be coiled and powerful; you should be secure about yourself and be ready to spring into action. It would be easy to imagine an alternative world in which Cuddy et al. found an opposite effect and wrote all about the Power Pose, except that the Power Pose would be described not as an expansive posture but as coiled strength. We’d be hearing about how our best role model is not cartoon Wonder Woman but rather the Lean In of the modern corporate world. Etc. And, the funny thing is, that might be good advice too! As they say in chess, it’s important to have a plan. It’s not good to have no plan. It’s better to have some plan, any plan, especially if you’re willing to adapt that plan in light of events. So it could well be that either of these power pose books—Cuddy’s actual book, or the alternative book, giving the exact opposite posture advice, which might have been written had the data in the Cuddy, Carney, and Yap paper come out different—could be useful to readers.

So I want to separate three issues: (1) the general scientific claim that some manipulation of posture will have some effects, (2) the specific claim that the particular poses recommended by Cuddy et al. will have the specific effects claimed in their paper, and (3) possible social benefits from Cuddy’s Ted talk and book. Claim (1) is uncontroversial, claim (2) is suspect (both from the failed replication and from consideration of statistical noise in the original study), and item (3) is completely different issue entirely, which is why I wouldn’t want to argue with claims that the talk and the book have helped people.

P.P.S. You might also want to take a look at this post by Uri Simonsohn who goes into detail on a different example of a published and much-cited result from psychology that did not replicate. Long story short: forking paths mean that it’s possible to get statistical significance from noise, also mean that you can keep finding conformation by doing new studies and postulating new interactions to explain whatever you find. When an independent replication fails, it doesn’t necessarily mean that the original study found something and the replication didn’t; it can mean that the original study was capitalizing on noise. Again, consider the time-reversal heuristic: Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find a statistically significant interaction happened somewhere.

P.P.P.S. More here from Ranehill and Dreber. I don’t know if Cuddy would consider this as bullying. One hand, it’s a blog comment, so it’s not like it has been subject to the stringent peer review of Psych Science, PPNAS, etc, ha ha; on the other hand, Ranehill and Dreber do point to some published work:

Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al., published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

From the standpoint of studying hormones and behavior, this is all interesting and potentially important. Or we can just think of this generically as some more forks in the path.

176 thoughts on “The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of claims about power pose)

  1. I think Cuddy’s “sickening” reaction to these kinds of criticism are the result of confusing Carney, Cuddy & Yap (2010), which refers to a set of empirical findings and theoretical claims, with Dana Carney, Amy Cuddy and Andy Yap, who (as far as I know) are nice people trying to do good science. The confusion sometimes happens for the critics, the criticized, and (especially) for readers of the critiques.

    Cuddy seems to take the criticisms of Carney et al. (2010) as a personal attack on her scientific integrity rather than an expression of skepticism about her findings and claims. Although in my experience critics are often careful to focus on findings and claims rather than individuals, readers of critiques seem to often come away with an opinion that the critic personally attacked the original authors. In this particular case, I think Cuddy is behaving like these readers. As far as I can tell, all the discussions (including Andrew’s) have been about the ideas and evidence rather than about the scientists. Like Andrew though, I do not know what is happening behind the scenes, where the criticisms (perhaps from readers of the critiques) may be much more personal.

    We would all do well to remember to focus on the scientific merit of the empirical findings and the theoretical claims rather than on whether Profs. X, Y, and Z are bad scientists (which is an inference that is difficult to make from a single set of studies). If they really are poor scientists, then a focus on their evidence and claims will reveal it without personal accusations. I like that in these kinds of posts, Andrew often includes a statement reflecting something like the above ideas (“I have no interest in getting into a fight with Amy Cuddy….”)

    • I disagree. I think that we most definitely need to call out bad scientists when they are doing bad science. These people often get tenured at the very top universities in the world on the merit of these bad studies. These same people may also create entire cottage industries around their initial results that we now know are bunk. So, yes, let’s take on the person as much as their science. There is a connection.

      • Also, let’s face reality here: she’s upset because she’s trying to make money off of this, and it turns out there’s no good evidence that it is a real effect.

  2. I like the time reversal heuristic as a way to think about statistical replication. I have found that the whole forking-paths concept is not “obvious” and it is often hard to convince people who are trained in the do-one-experiment-and-find-significance school of thought. With the rise of A/B testing (better called large-scale online testing), people with little statistical training are running tests: I find that there is strong resistance to what we think are basic concepts like not running tests till significance, multiple testing, etc.

    What concerns me the most here are: (a) Cuddy’s implication that the replication of one of a set of metrics can be considered a successful replication, especially when the other metrics are not even close to replicating; (b) the idea that the replication needs to have precisely the same conditions as the original experiment – this implies that the finding is highly sensitive to original conditions, which does not give me confidence in the result.

    I wrote a little more about this here.

    • I would agree measure of replication _should_ be time invariant.

      One suggestion would be measure how commonness of the direction a common prior is moved individually by each study towards its posterior based on just each study – that would be the direction of Posterior.i/Prior as a _vector_. With k multiple findings those _vectors_ would be of dimension k and moving most k in the same direction would be needed for a high measure of replication.

      The decision of when enough studies has been done to take the direction as being adequately determined (i.e. worth publishing a claim about) could be assessed separately.

      p.s. Kaiser: your shopping example seems good to me – I’ll steal it.

  3. So we as social scientists have accumulated capital from our research successes. There is a bigger process unfolding (replication failures, misunderstood p-values etc.) eroding the valuation of some of that accumulated wealth. That there is so much over leveraged intellectual property in social science makes everyone uneasy. And it is understandably very upsetting for an individual when they suffer a devaluation. It just feels personal, as if you alone are taking the hit for what should be a broader market correction. And it’s also understandably frustrating that it’s unclear how to fight back against that devaluation.

    I guess I’ve often felt bad for new researchers because they will have a more difficult hill to climb. But I guess they’ll also have the benefit, if they are careful, of building less leveraged wealth. Hopefully they won’t have to give back big gains as often.

  4. Came across this sweet and sour treat late, so I’m a little behind.

    Sometime in the 80s and 90s, people went on about power ties. I remember sitting around a computer (not the AI) ‘lab’ at the Sloan School of whatever it is they do there at MIT (before we all had our own, personal, just to ourselves computers) listening to business students debate the color of the _paper clips_ on their job applications. (Yes, really.) Anyone seen the opening minutes of “American Psycho” with Christian Bale? The moments when Bale begins to seethe as they Wall Street wannabe tycoons compare size … I mean business cards? Business cards. These are things that business baboons worry about. They’d like us to believe they are geniuses at driving the economy. But they’re really more like peacocks posing. Or baboons presenting. (Google zoologists or primatologists on “presenting”.)

    So power poses. Big fat hairy deal. To somebody.

    The interesting question to me: What is the actual return on posturing? It suited Einstein and Picasso very well. Consummate self-promoters. But they had a few other things going for them, too. (Think of “consummate” verb vs. “consummate” adjective, and we have a sense of where the ‘we’re really only doing what our primate cousins do’ thinkers are headed.)

    Which of the founders said of George Washington “He was bound to be the first president. He was always the tallest man in the room”?

  5. Isn’t the quote “somehow people missed that they STILL replicated the effects on feelings of power” indicative of the entire problem? Was “feelings of power” the *one* thing that they claimed the power pose would improve? It sounds like the original author is hunting around the replication paper, looking around for p < .05, and saying "Hey, I've got something!" If that's what they are doing with someone else's data what should we think they did with their own?

    I would be concerned that even people in academia are mistaking something done in a university, with some kind of math, and published in a journal as being "real science," while ignoring the much more important hurdles of something done with an experimental set up that is capable of finding the evidence that one hopes to find, with a limited amount of flexibility for the experimenter to enhance noise into a "signal", and standing up to rigorous critiques.

  6. Just a few things.

    (1) It does seem kind of ridiculous to have people hold any pose other than “lounging on the couch” for six minutes, as Cuddy says (presumably correctly) was done in one of the so-called “replications.” I have sometimes been puzzled by the fact that many attempted replications make substantial changes from the protocol they’re supposedly replicating. I recognize that even if you do try to perform a perfect replication, that’s not going to happen: there will be lots of things that weren’t published in the original paper that you’ll have to guess at; both the experimenters and the subjects will be different from the original set, perhaps in important ways; you’ll be doing the replication in a different building in a different season and subjects will be in different moods, and all sorts of things like that. But with something like “hold the pose for N minutes,” if the claim is that there is an effect when N=2, why on earth would you do N=6? (Unless you were _also_ doing N=2).

    (2) It’s funny=amusing that Cuddy puts such faith in peer review — “The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach” — considering she’s presumably as familiar with the peer review sausage factory as I am. At times I have gotten thoughtful, challenging reviews that have made me do a lot of improving work on a paper, by reviewers who were obviously diligent and thorough enough to serve as the gatekeepers that the peer review process intends them to be. And at times I have been that kind of reviewer myself. But frequently I have gotten reviews that were clearly provided by reviewers who hadn’t read the paper carefully (or, perhaps, hadn’t read it at all). I’ve done some pretty cursory reviews myself; more than once I’ve provided a review with the caveat “I did not check all of the equations.” And nearly no reviewer actually checks the results of a statistical analysis…indeed, you usually can’t, because the data aren’t provided. Putting it all together, I don’t make a huge distinction between “peer reviewed” literature, the “gray” literature of published (but not peer-reviewed) papers and reports, and other sources of information such as blogs, code repositories, wikis, etc. This is not to say I don’t make any distinction at all: given no other information, I’d certainly trust a paper from a peer-reviewed journal over a blog entry on some blog I’ve never heard of. But I usually do have other information: I’m familiar with the authors, or their institution; the blog isn’t just some random blog I’ve never heard of, but one that I follow or that has been recommended to me; and so on. In short (too late!) I find it funny that someone would object to anyone getting information from a “non-peer-reviewed blog” by someone they know and respect. What’s more, I’m going to go out on a limb and suggest that this objection wouldn’t have come up if the “non-peer-reviewed blog” had _supported_ the findings rather than contradicting them.

    (3) Greg Francis’s point about taking criticism personally is very apt. I’ll extend it to say that it’s also easy to make things personal in one’s response. If Cuddy has indeed been “quiet and polite” in her response up to now — I haven’t followed this so I have no idea — then at least the “polite” part of that is laudable (I don’t see why “quiet” is a good thing but I presume she means it metaphorically, that she hasn’t been raising the emotional intensity of the conversation.) All I can say to this is: Cuddy, if you’re reading this, I encourage you to _continue_ being quiet and polite. Saying people are “bullying” you because they express skepticism in the magnitude of an effect that you claim, and give reasons for their skepticism…that’s impolite.

    (4) Andrew, I have a suggestion for future posts on the theme of “garden of forking paths” papers and other questioning of whether claims of substantial effect are justified: use the authors’ names only when citing the research, and don’t make a point of it. Your Slate article, for example, seems to be more about Cuddy and about the claimed Power Pose effect. Cuddy says this, Cuddy says that, Cuddy gave a Ted talk, etc., etc. WTF, man? Instead of “Amy Cuddy’s famous finding is the latest example of scientific overreach”, how about “Recently hyped findings are the latest example of scientific overreach”, or similar? Instead of “Consider the case of Amy Cuddy”, how about “Consider the case of the Power Pose”? Why make it personal?

    • Re: Points (3) and (4)

      Several years ago, I authored a piece that attacked an idea that had been put forth by several “experts in the field”. But, I wanted to avoid criticizing those individuals by name. I quoted several authors making such flawed assertions—in order to show that this idea had currency. But, I left the attribution in the footnotes and never put an author’s name in the text. It did not make sense to leave out the references—but I could at least make the reader work a little to discover the author.

      Similarly, I recently gave a conference presentation. My powerpoint displayed some deeply flawed assertions that I was rebutting. For most such assertions, I footnoted the source for each assertion on the slide where they appeared. But, my footnotes were in 8 point type and could not be read by a person with normal vision.

      So, if someone disputed the accuracy of my quotations, they could find the source and check them. But, the average member of the audience would not have been able to tie the assertions to a specific author.

      Now, given that “power posing” seems tied to one particular individual, it would be harder to employ this technique. Still, it might be worth trying.

      Bob

    • Phil:

      All good points (as I’d expect from a co-blogger).

      Regarding your point 1, I’ll ask around about the 6-minute thing and get back to you. Actually, I’m a pretty impatient person and I wouldn’t even want to sit for 2 minutes.

      Regarding your point 4, I agree in general, and as you perhaps recall, I talk about the himmicanes and hurricanes study, and the fat arms study, and the ovulation and voting study, without generally emphasizing their authors. Cuddy I’m not so sure of, because she, like Steven Levitt and Satoshi Kanazawa, is part of the story. In her Ted talk etc., she’s not just selling power pose, she’s selling Amy Cuddy, just as Freakonomics was not just selling Levitt’s research, it was selling Levitt’s persona. That’s fine, I have no problem with it, but then Cuddy is a big part of the power pose story.

      • Phil has some good points. I don’t agree with Point 1 though. So the power pose works if it is held for 2 min but not when it is between 3 and 4 min but works if it is 5.2 min but not 6 min, etc. We can ask the researchers why they chose 6 minutes but like I said above, if the effect is sensitive to things like that, I wonder if we are observing something real.

        • I wonder if we’re observing something real too, and I think if someone wants to try 2 minutes and 4 minutes and 6 minutes that’s fine. But to _just_ do 6 minutes, and not do 2 minutes or anything close to it…that’s not a replication.

          There was recently a story in the New Yorker about some oncologists who found an effective treatment for, um, it might have been non-Hodgkins lymphoma, I forget (but you can look it up) back in the 1960s. They reasoned that they needed to use several drugs in sequence to avoid selecting for resistant cells, and that they needed to cycle through them several times in succession at essentially the highest doses and shortest recovery periods the patients could tolerate. It might take years to get such a study approved today, but in those freer times they were able to just go ahead and try it, according to the article. So they did…and it worked! Complete cures in some patients, long-term remission in others…huge, huge effects. They published their findings and gave some talks. Some time later, one of them visited some famous hospital in New York and discovered they weren’t using the protocol, and what they were doing instead — the old standard treatment — was as ineffective as ever: their patients were dying. The oncologist asked “why aren’t you using our protocol?” and was told “we tried it and it didn’t work.” He couldn’t believe it so he asked for more information…and it turned out that the doctors who tried it didn’t like one of the four drugs, so they swapped it out for another one. And they didn’t want the patients to be so uncomfortable so they decreased the dosages and increased the recovery time between cycles.

          The Power Pose thing sounds like claptrap to me, but the way to test whether the claimed effect is real is to try the same thing Cuddy et al. tried, not to try something very very different. And from the standpoint of holding a pose, 6 minutes is very very different from 2 minutes. Perhaps more to the point, why NOT try 2 minutes? What is to be gained by doing a different time?

        • You should do it at 1 minute, 2 minutes, 3 minutes, and 5-6 minutes if you want to really do a proper experiment.

          2 minutes seems extremely arbitrary – and thus possibly cherrypicked. What if they had them hold poses for other lengths of time, and it didn’t work? That would suggest that the two minute number is arbitrary.

          Going for 1 minute and 3 minutes gives you an idea if plus minus 50% matters. 5-6 minutes would test if holding it for much longer gave a larger effect or negated it or whatever.

          If I saw an effect, I’d probably go for 15 and 30 seconds as well and see if there’s an effect for shorter time periods as well. See if there’s some minimum time.

  7. Your follow-up posts are the best!
    This one was timely for me—I clicked on a ‘Presence’ ad on npr.org by accident. Good thing I had been primed to be skeptical of the effect size!

  8. As impolite as this might be, perhaps Andrew should also look at the “preliminary research” that Cuddy cites in her recent New York Times op-ed (nytimes.com/2015/12/13/opinion/sunday/your-iphone-is-ruining-your-posture-and-your-mood.html):

    “How else might iHunching influence our feelings and behaviors? My colleague Maarten W. Bos and I have done preliminary research on this. We randomly assigned participants to interact for five minutes with one of four electronic devices that varied in size: a smartphone, a tablet, a laptop and a desktop computer. We then looked at how long subjects would wait to ask the experimenter whether they could leave, after the study had clearly concluded. We found that the size of the device significantly affected whether subjects felt comfortable seeking out the experimenter, suggesting that the slouchy, collapsed position we take when using our phones actually makes us less assertive — less likely to stand up for ourselves when the situation calls for it. In fact, there appears to be a linear relationship between the size of your device and the extent to which it affects you: the smaller the device, the more you must contract your body to use it, and the more shrunken and inward your posture, the more submissive you are likely to become. Ironically, while many of us spend hours every day using small mobile devices to increase our productivity and efficiency, interacting with these objects, even for short periods of time, might do just the opposite, reducing our assertiveness and undermining our productivity.”

    Hmmm. The study (see dash.harvard.edu/bitstream/handle/1/10646419/13-097.pdf?sequence=1 ) involved giving 75 people a bunch of tasks/surveys on an iPod Touch, an iPad, a MacBook, or a desktop (4 treatment groups). After the tasks/surveys were done, the experimenter would leave the room and promise to be back in 5 minutes, but wouldn’t, in fact, return. The question was how long the 18-20 people in each treatment group would wait to leave the room and find the experimenter. The finding was that the smaller the device, the longer people would wait to leave. Cuddy and her co-author take this as evidence that with smaller devices, people are more hunched over, and are therefore less assertive.

    Hmm. This finding could merely be because people liked or were more familiar with the smaller devices, or because they found smaller devices more distracting, or something else. It seems quite a leap to assume that a more hunched posture from smaller devices (note: posture wasn’t directly measured, as far as I can tell) made people more submissive.

    • ” In fact, there appears to be a linear relationship between the size of your device and the extent to which it affects you: the smaller the device, the more you must contract your body to use it, and the more shrunken and inward your posture, the more submissive you are likely to become.”

      Actually I have myself noticed this linear relationship between the size of my device and the extent to which it affects me, but not in an RCT.

    • It must be a real effect because it’s significant with less than 20 subjects per condition. Therefore, it’s very large.

      Further, one needn’t measure the actual hunching. That obviously must have happened. And please ignore our irrelevant gambling tasks.

    • Another theory: the tendency to leave a room varies in inverse proportion to how easy it is for someone to steal a device with which you’ve been entrusted. If I take a phone with me I’m worried about someone thinking I’m trying to abscond with it; if I leave it in an empty room it might disappear before I get back. Both dangers are smaller with a laptop.

      Not saying that explains the statistics; I do suggest my model is at least as plausible as Cuddy’s.

  9. There is always another caveat or confounder round the corner. If you really want to get rid of a significant result you will succeed.

    When you do empirical research you need to have a theory first. I scanned the replication study and it reads as a purely empirical exercise. I miss a discussion of why no effect was found while the theory of the original study says otherwise. You either inherit that original theory and own it or come up with your own theory. Expecting someone else is wrong is not a theory.

    Surely as a layman I think this power pose thing is fishy. But I am underwhelmed if this is the poster child for replication.

    • Martin:

      As Simmons and Simonsohn explain, no special theory is needed to explain a statistically significant comparison in the first experiment and no such pattern in the replication. These results are entirely consistent with zero effect and the garden of forking paths.

      This is not to say that the effect is zero, just that the results from an initial statistical significance in a non-preregistered study, followed by a null result in a replication, are consistent with a null effect. Hence no theory needed, except the quite reasonable theory that how you sit for a couple minutes will have minimal effects on hormone levels.

      • Don’t you take issue with this statement by Simmons and Simonsohn: “While the simplest explanation is that all studied effects are zero”?

        Since when are exact-zero effects the “simplest explanation”? A model with an exactly zero effect is highly specific and not “simple” at all by any defensible definition of simple. The statement confusingly reports the correct conclusion of “consistent with zero effect model for the sake of making an implicit normative claim (one should believe this model because it’s the simplest one).

        • Anon:

          I agree with Simmons and Simonsohn that the exact-zero story is the simplest. I agree with you that exactly zero effects are not plausible. That’s why I prefer to say something like, The effects appear to be too small and too variable to estimate in this way.

        • How about the following for the usual “compare two groups” type of study.

          Scenario A: Narrow interval that includes zero
          Scenario B: Wide interval that includes zero

          Given the methodology, data, and statistical model used
          A) this evidence is inconsistent with theories predicting large positive or negative values and the factors involved are expected to be relatively unimportant to any decision making process.

          B) this evidence has little ability to distinguish between different theories and it remains uncertain what influence the factors involved should have on any decisions making process.

    • What a strange interview.

      I find it quite surprising that actual scientists might read a paper that draws causal conclusions from observational data using only multiple regression and accept the conclusions without a question. It is also surprising that someone like Nisbett talks about multiple regression as if the tool in itself or the observational data on which it is applied are the problem. He must be aware, I’m sure, of techniques such as propensity score matching, that allows causal inferences from observational data, sometime using multiple regression.

      But the most strange point in his interview is that he seems to imply that causal inferences from experimental data is iron-clad and always informative. As if those frail results from social psychology, that replicate under this-and-that unknown circustances and not under another, also unknown, circustances, were somehow advancing our knowledge about cognitive phenomena and should not be questioned when they do not replicate. We have this body of hard-to-replicate effects that somehow points how our behavior might be driven by unimportant cues, but this causal effect appears in some context and not others, and we can’t make any sense about it! Or we can search for “moderators” and claim to have found something new when p < 0.05 in every new experiment, adding more noise to the heap.

      It's the worst kind of Lakatosian Defense for a theory that has so little in the bank, quoting Meehl. Social Psychologists have a hunch-theory that apparently unimportant cues migh affect behavior. It's more a hunch than a theory, because we can't really deduct any substantive hypothesis from it. Then, an experiment find that some kind of priming has a positive effect on some behavior. Great, it seems that our hunch is not completly wrong! Then, a replication does not find the same effect. But wait, if we condition on this other covariate, we find a similar effect. Or we don't, but it only means that contextual cues are so contextual that the effect varies a lot, and our "theory" is more strongly corroborated.

      • If a disgruntled researcher trusts his spidey sense over empirical evidence and the scientific method but no-one is there to read it, does it even make a sound? I can keep going. . .

        • Also, does it seem a bit creepy to anyone else that Andrew is blogging point by point about Cuddy’s post on her friends Facebook page? Because it does seem a little off color to me.

        • James:

          I’m not actually on facebook. As I wrote in my post above, someone emailed me the link. It doesn’t seem creepy at all for someone to respond to criticism. I don’t think it’s creepy for Cuddy to have responded to the criticism of Simmons and Simonsohn, I don’t think it’s creepy for Cuddy to have responded on facebook to my criticism of her claims, and I don’t think it’s creepy for me to respond on a blog to Cuddy’s criticisms. Nor for that matter to I think it’s creepy that you’re commenting here.

          Criticism and give-and-take, that’s central to science. Indeed, I’m responding to you here because communication is important to me, so I like to clarify confusing points. When one commenter raises a question, it could be that many readers have a similar question, hence this effort.

          Finally, you can feel free to trust your “spidey sense” over the empirical evidence presented by Ranehill et al. and analyzed by Simmons and Simonsohn. I don’t have any spidey sense to bring to this particular problem so I’ll just have to go with the empirical evidence and the associated statistical arguments. (The statistics are necessary in this case to generalize from sample to population, as it is the population that is of interest.)

        • I take it as a sign of Andrew’s over-generous nature, and tendency to err on the side of encouraging opposing viewpoints, that the above comment was allowed through moderation. This kind of hateful trolling has no place in a reasoned discussion.

          (This comment doesn’t really progress any useful discussion either, so I will take no offense if it is excised in moderation.)

        • Andrew’s post doesn’t seem creepy or off-color to me in the slightest. I would characterize it as dead-on and pitch-perfect.

        • Where’s Amy Cuddy’s Facebook comment anyways? I cannot seem to find it at the link Andrew posted.

          Has she taken her response down?

        • Hey, if you don’t want others to read it, use email. Not FB posts.

          Besides, FB offers you a ton of ways to restrict who can see what you post.

          If Andrew can read a comment, its fair game to be allowed to respond.

  10. I have enjoyed the slate piece and this discussion very much.

    I wouldn’t be surprised if there weren’t a power pose effect at all (or if it were too fragile to be interesting). However, bringing in considerations about publication incentives, I have worries about how to interpret the epistemic import of one (1) non-replication. After all, independent replicators of popular findings can be expected to be biased for the null. I.e., if their baseline belief about the original finding was not skepticism, they wouldn’t even be attempting a replication study in the first place. Am I missing something?

    Glad to read that there will be a multi-site preregistered set of studies on the issue.

    • Assuming that Ranehill passed 9th grade science, she knows how to follow an existing method. Why then did they not follow the published method if not in the hope to perform a non-replication? The incentive system certainly rewards non-replication over replication so it surprises me that these methodological variations aren’t viewed with more suspicion. It seems to me that much of this blog-land analysis is more emotionally driven than logically driven. Perhaps thats why none of this has passed the peer review standard yet.

        • It would indeed. Why do you think they didn’t follow the method in the first place. Indeed they are getting more attention for non replicating then they would if they had replicated. There’s been a lot of fussing and statistical trickery and nobody has even attempted to replicate.

        • This failure to perform an actual direct replication happens all the time. It is bizarre from scientific perspective, I’m not sure there are more rewards for a “non-replication”, rather it gives everyone an out and need for more funding. “Oh it must have been factor x”.

          “It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and chal-lenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological net-work, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once re-futing or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and recon-structive efforts of Carnap, Hempel, and Popper to unscramble the logical relation-ships of theories and hypotheses to evidence. Meanwhile our eager-beaver re- searcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modem statistical hypothesis-testing, has produced a long publica- tion list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.”
          http://cerco.ups-tlse.fr/pdf0609/Meehl_1967.pdf

    • You are indeed missing something: Ranehill et al. were looking to “replicate and extend” the findings, i.e. exploring moderators of power pose effects, while replicating the main effects along the way.

      • “exploring moderators of power pose effects, while replicating the main effects along the way.”

        But no, you can’t explore duration as a moderator of the power pose effect when you use only one duration. Had they used 2 and 6 seconds, and estimated the difference in effect, that would be a moderation analysis. Using only 6 seconds is really just a different experiment altogether. No moderation exploration. No replication either.

      • Do any of the esteemed professors on this blog see any problem with dismissing the results of one experiment with an experiment with a different manipulation? If not then this is an even sadder situation than I thought.

        • There are two points. A Ranehill experiment that does not follow the method and is therefore not a replication. And a blog post that is not peer reviewed and none of you actually understand. For now let’s focus on how an experiment with a different method is not able to say that the result of the first experiment is invalid.

  11. Thank you Andrew for a great post with many important teachings!

    >While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve)

    Could someone please explain how this “right-skewed p-curve” would be generated in principal, or please link to a basic reference? “P-curve” means a curve of p-values, right?

    It sounds like this curve of p-values should be retrospective (“all studied effects”), but then how are multiple p-values generated? Via some kind of bootstrap process? Is there any standard terminology for any of the specifics here?

    Thanks in advance for any assistance.

  12. Off topic, but there’s already a ton of warning flags here screaming “Bad Science Ahead”:

    (1) Professor of Business Administration. (2) Author has a BA + MA + PhD all in Social Psychology. (3) Work published in “Psychological Science” (4) No pre-registration done (5) All co-authors are from a Business School

    Admittedly, none of these factors individually are a clincher, but taken together my priors would be extremely skeptical that I’m going to encounter any good Science ahead.

    • With all due respect: I do not think it is fair nor necessary to make such a broad statement. We do not have any evidence on whether science from individuals with these characteristics is any better or worse than research in other disciplines. To make inferences on the quality of research in entire (very large and diverse) domains from problems with individual cases does not seem appropriate to me. With this comment, I do not want to make any judgement regarding the study which is discussed here, I just think it is best to focus on the characteristics of the paper.

      Disclaimer: I am at a business school, and I know many colleagues that share the characteristics described above that do serious and excellent research.

      • We all have our different priors I guess. I only stated mine.

        e.g. The editors of Psych. Sci. may not think it fair for them to become notorious for publishing non-replicable work.

        In any case, I wasn’t picking on one factor, but a combination. And the combination I outlined above, in my opinion, is one that would make me very leery.

  13. Cuddy seems to use salivary testosterone tests. Is this well correlated with blood sampled testestorone? From what I know doctors are somewhat skeptical about salivary tests of Testosterone when using them for treatments.

    Here’s one abstract:

    In a series of studies, we identify several specific issues that can limit the value of integrating salivary testosterone in biosocial research. Salivary testosterone measurements can be substantially influenced during the process of sample collection, are susceptible to interference effects caused by the leakage of blood (plasma) into saliva, and are sensitive to storage conditions when samples have been archived.

    http://www.ncbi.nlm.nih.gov/pubmed/15288702

    Is there a danger of noise swamping the signal? She doesn’t test on-site. The samples are frozen and dispatched to a remote lab after being frozen for varying periods (“upto two weeks”).

  14. The line of work that Amy Cuddy is promoting is a nice example of the power of hype. It shows that you really can get very far with an idea that is (a) simple (it takes no intellectual effort or technical background knowledge to understand it), (b) easy to apply in real life (no real effort is needed). Just raise your hands up high and you’re done.

    If someone were to give a Ted talk saying that you have to put in years and years of dedicated effort to become good enough to be successful in a job interview in some high skill area, you would not get 30 million hits.

    Regarding Andrew’s point, about Cuddy not even being willing to consider that her hypothesis might be unsupported by data even in principle, maybe Andrew needs to send her a copy of Superforecasting. There, the author lists the characteristics you need to have in order to be a good forecaster, which is basically what a scientist is supposed to be. Ability to question one’s own beliefs is up there high on the list.

    • Shravan:

      Regarding your second paragraph: Yeah, you’re not kidding. I just googled *ted 10000 hours* and found a bunch of links to a Ted talk by Josh Kaufman who, according to one of the links says It doesn’t take 10,000 hours to learn a new skill. It takes 20:

      The speaker, Josh Kaufman, author of The Personal MBA, explains that according to his research, the infamous “10,000 hours to learn anything” is in fact, untrue.

      It takes 10,000 hours to become an “expert in an ultra competitive field” but to go from “knowing nothing to being pretty good”, actually takes 20 hours. The equivalent of 45 minutes a day for a month. . . .

      • I have seen that 20 hours talk. I think he demonstrated his ability to learn the ukulele in 20 hours in that talk. I guess it all depends on how you define “learn”.

      • Though perhaps true for “to go from “knowing nothing to being pretty good at not being caught by most as still knowing almost nothing”, actually takes 20 hours.

        Maybe not even 20.

        Some of my past endeavors:

        Learn to operate and demonstrate the safe use of industrial sandblasting equipment just from the product documentation (almost got caught in my first demo by not previously experiencing the push back from the nozzle).

        Setup and test a compressed air station onsite at a gas plant in the Yukon (what was new was doing this on site without the help of the service manager – almost got caught never having flown before).

        My first statistical consult at U of T three weeks before starting Msc biostats program as other students and faculty were away on vacation (saved by the very concise Analysis of Cross-Classified Categorical Data by Stephen Fienberg).

      • Shravan, Keith:

        Indeed, I’m not saying Kaufman is wrong. I’d love ot be able to play the ukulele just a little bit, so maybe I’ll give it a try. I was just supporting Shravan’s earlier point that simple-and-easy ideas are popular!

  15. To all the commenters focusing on the differences between the two experiments:

    1. The Cuddy et al. experiment differs from the Ranehill et al. experiment by being a shorter treatment whose purpose is hidden from the participants. It’s possible there’s a large effect under the Cuddy et al. treatment and a small or zero effect under the Ranehill et al. treatment, but to make that claim you’re really threading the needle, and at the very least this calls into question the general claims made for power pose in practice.

    2. Don’t forget the time-reversal heuristic which is the topic of the above post. If you start with the null finding of Ranehill et al., then follow up with a smaller study with a weaker treatment that is not blind to the experimenter and, as usual, is subject to the garden of forking paths: this does not add up to strong evidence in favor of their being a large and consistent effect. My point is that in the usual discussion the power pose effect benefits from a sort of incumbency advantage: this becomes clearer once you take that away via the time-reversal heuristic.

    3. Finally, if you go back to my Slate article and my post here the other day: My point of writing all this was not to shoot down the claims of strong evidence for power pose—Simmons and Simonsohn did that just fine already, and indeed I reported on the Ranehill et al. experiment several months ago—but rather to point out the disconnect between the outsiders who see the Ted talk or the Harvard Business School press release or the airport book or the p less than .05 and think it’s all proven by science, and the insiders who have a much more skeptical take. The insiders aren’t always right, but in any case this difference in perspective is interesting to me. And I think that at some point even NPR and Ted will catch on, and realize that if they blindly promote this sort of study, they’ll be on the outside looking in.

    • If the Cuddy “power poses” advice really works, it’d be interesting trying to get people to do it without them knowing why they are doing it. Does she have any thoughts on this, I wonder.

      Apparently if the subjects know, the trick doesn’t work, eh?

      • I suspect it helps in the sense that it probably helps the candidate do a bit better and look and feel more confident. From a practical perspective, the question for me is, how much does it help relative to other factors? Competence, for example. Probably not much. If I am going to spend my time preparing for the interview of my lifetime, I would probably be better off spending my lifetime preparing for the interview, rather than admiring my underarms in the mirror for the recommended number of seconds.

    • The effect is totally useless if it only manifests itself if you don’t know what is going on. Ranehill’s study has superior methodology given Cuddy’s treatment of it as a self-help method.

      That said, I am interested in why they didn’t just replicate the method exactly.

  16. “Don’t forget the time-reversal heuristic which is the topic of the above post.”

    The time-reversal heuristic is a clever way to think about such issues. I’ll try to keep it in mind.

    Of course, a Bayesian analysis gives us
    P(X) ~ Prior(X)*Likelihood(X|first experiment)*Likelihood(X|second experiment).

    Order of the experiments does not matter—the product of the likelihoods is the same. It seems to me that the time-reversal heuristic is a good way to remind ourselves that the first experiment does not deserve extra weight just because it was first.

    Bob.

  17. Hi everyone,

    We (Ranehill et al.) started our study with the aim of replicating AND extending the results of Carney, Cuddy and Yap (we have three behavioral tasks, not one as in Carney et al., in our study). Therefore, we made a few changes to the experimental protocol that seemed, for our purposes, improvements to the initial study. We have, therefore, always acknowledged our study as a conceptual replication, not an exact replication.

    Our online supplemental material gives full information on all deviations between our design and those of the original study (downloadable here http://pss.sagepub.com/content/26/5/653/suppl/DC1). We also wrote more about the differences between our study and Carney et al. in the post on datacolada (http://datacolada.org/wp-content/uploads/2015/05/Data-Colada-reply-v3.pdf) where we discuss the likelihood of potential moderators suggested by Carney et al for explaining the difference in results.

    With respect to the time spend in the poses we had participants hold the poses for 3 min each, instead of 1 min each as in Carney et al. We asked the participants whether they found the poses uncomfortable or not, and in the supplementary material of the paper we analyzed the data separately for those who reported the poses to be comfortable and this does not change our results. In particular, we write the following in the supplementary material:

    “A referee also pointed out that the prolonged posing time could cause participants to be uncomfortable, and this may counteract the effect of power posing. We therefore reanalyzed our data using responses to a post-experiment questionnaire completed by 159 participants. The questionnaire asked participants to rate the degree of comfort they experienced while holding the positions on a four-point scale from “not at all” (1) to “very” (4) comfortable. The responses tended toward the middle of the scale and did not differ by High- or Low-power condition (average responses were 2.38 for the participants in the Low-power condition and 2.35 for the participants in the High-power condition; mean difference = -0.025, CI(-0.272, 0.221); t(159) = -0.204, p = 0.839; Cohen’s d = -0.032). We reran our main analysis, excluding those participants who were “not at all” comfortable (1) and also excluding those who were “not at all” (1) or “somewhat” comfortable (2). Neither sample restriction changes the results in a substantive way (Excluding participants who reported a score of 1 gives Risk (Gain): Mean difference = -.033, CI (-.100, 0.034); t(136) = -0.973, p = 0.333; Cohen’s d = -0.166; Testosterone Change: Mean difference = -4.728, CI(-11.229, 1.773); t(134) = -1.438, p = .153; Cohen’s d = -0.247; Cortisol: Mean difference = -0.024, CI (-.088, 0.040); t(134) = -0.737, p = 0.463; Cohen’s d = -0.126. Excluding participants who reported a score of 1 or 2 gives Risk (Gain): Mean difference = -.105, CI (-0.332, 0.122); t(68) = -0.922, p = 0.360; Cohen’s d = -0.222; Testosterone Change: Mean difference = -5.503, CI(-16.536, 5.530); t(66) = -0.996, p = .323; Cohen’s d = -0.243; Cortisol: Mean difference = -0.045, CI (-0.144, 0.053); t(66) = -0.921, p = 0.360; Cohen’s d = -0.225). Thus, including only those participants who report having been “quite comfortable” (3), or “very comfortable” (4) does not change our results.”

    We believe that what is likely the most important departure in our study from Carney, et al., is that we employed an experimenter-blind design, in which the researcher conducting the behavioral (risk and competitiveness) tasks did not know whether the participant had been assigned to an earlier high- or low-power pose. Of all the differences between the two designs, we believe this one seems the most likely to play a key role in explaining the difference in results.

    Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al. , published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

    Thanks everyone for a great discussion, and thanks Andrew for giving this so much attention.

    Eva Ranehill and Anna Dreber

    • Just curious, how much did it cost to run your replication? Would it be feasible for you, to run another variant where you replicate Cuddy’s design *exactly*? e.g. Cuddy also seems to question the fact that in your replication the subjects knew what the goal of the study was.

      Not that I think Cuddy’s effect is real, but if that’s what it’d take for everyone to be satisfied might be worth it?

      Especially, given how much impact her hypothesis has had surely someone must be interested in funding an exact replication? And given your one replication already, you seem best placed to run another!

      Just my 2 cents.

      • Rahul:

        As noted in my post, Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose. So you will get your wish, or something like it.

        • Rahul:

          Depending on the content of the coming special issue we are considering running a direct replication.

          The cost for our study was about USD 15000, but the real cost is time since participants take part in the study one at the time.

        • A different question:

          Is there previous literature on how reliable salivary testosterone measurements are? I assume a blood serum measurement is the gold standard.

          One suggestion: If you run another replication, do you think it would be feasible to do a blood draw and serum testosterone assay too?

          I’m wondering if what they are chasing based on the salivary measurements is mostly noise?

  18. I’ve always been struck by the low sample size of the original Carney et al. paper. Who sets out planning a study on a phenomenon that they must have expected to have a weak effect and decides that an N of 42 would be appropriate? If one of my graduate students came to me with that research plan we would have a very long and uncomfortable chat.

    It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

    Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not <.05.

    Am I missing something??

    • Just realized (duh!) that this would undermine one of their central claims. I probably flubbed a simple chi-square calculation so will probably eat crow in a few minutes.

      • How would the difference between p=.04945 and p=.052 change anything about our inferences?

        a) p=.05 is not magic, and being one side or the other should not change any (sane) persons conclusions.

        b) the world is still a world where small manipulations of individuals produce small and varying (in sign and magnitude) effects across different people and it isn’t crazy to think that power poses will affect people in various ways

        c) it would still be a poorly designed, under-powered, noise-chasing type of “experiment” that doesn’t teach us much of interest about human beings or the technology of self-improvement

        • I agree with you completely that the difference is meaningless but I doubt that the editors and many of the reviewers of the paper would see it the same way. The perception (accurate or not) that you have to get your p-value to below this stupid threshold in order to have something meaningful is a belief that the authors appear to hold. A misreporting of this kind also suggests the possibility of some fudging of the results which should be distasteful to anyone.

        • I get the same chi-square numbers as you. Not good.

          Also, check out this ANOVA:
          >>Finally, high-power posers reported feeling significantly more “powerful” and “in charge” (M = 2.57, SD = 0.81)
          >>than low-power posers did (M = 1.83, SD = 0.81), F(1, 41) = 9.53, p < .01; r = .44.
          If you make the most optimistic possible unrounded figures for this (Mhigh=2.574999, Mlow=1.825, both SDs=0.805), you get F=9.0933, not F=9.53. (You can also do a t-test and square the t value). The p value is still significant and <.01, but if this result can be confirmed, it's still sloppy reporting. (And it seems to me that the DFs should be (1,40) and not (1,41).)

          Oh, and while we're on the subject of sloppy reporting, have a look at 10.1111/j.1540-4560.2005.00405.x and calculate the t statistics and associated p values.

        • Nick:

          Good catch. I went to the relevant section of the paper, the Results section, and I see this:

          We created a composite score of warmth by averaging the three warmth items, α = .81. A one-way ANOVA revealed the predicted main effect on this score, F(2, 52) = 3.93, p <.03, such that participants rated the high-incompetence elderly person as warmer (M = 7.47, SD = .73) than the low-incompetence (M = 6.85, SD = 1.28) and control (M = 6.59, SD = .87) elderly targets. Paired comparisons supported these findings, that the high-incompetence elderly person was rated as warmer than both the low-incompetence and control elderly targets, t(35) = 5.03 and t(34) = 11.14, respectively, both ps <.01. In addition, reflecting the persistence of the stereotype of elderly people as incompetent, participants saw targets as equally (in)competent in all conditions, F(2, 52) = 1.32, n.s.

          I can’t quite follow what they’re doing here. It says N=55. It’s not clear to me whether it’s a between or within-person experiment, but given that they are presenting means and sd’s separately, I assume it’s between-person. But then don’t you need to know the N’s for each of the 3 conditions? If we assume N=18, 18, 19, then the correct t statistics are (7.47 – 6.85)/sqrt(.73^2/18 + 1.28^2/18) = 1.79 and (7.47 – 6.59)/sqrt(.73^2/18 + .87^2/19) = 3.34, respectively.

          Maybe there’s something we’re missing in how the data were analyzed? I’d guess that someone just slipped up when typing numbers into a calculator, but who knows? A t statistic of 11.14 should’ve rung some warning bells, that’s for sure!

          The thing that really worries me about this particular paper is the garden of forking paths. From the study design:

          This design allows us to explore how different levels of stereotype-consistent behavior mightimpact not only ratings of competence, but also ratings on the other key dimension, warmth.

          As noted above, there were three experimental conditions and then at least four outcomes:

          Participants then indicated how much George possessed each of a series of personality traits, on a scale from 1 (not at all) to 9 (very). We included one manipulation check (competence) and three items designed to assess warmth (warm, friendly, and good-natured).

          They then reported results on the average of the three warmth items and separately on the competence items.

          Lots of multiple comparisons problems here. Just considering what’s in the paper already: they report two F tests, one of which is significant. They report two t tests, one of which is significant. They come up with an elaborate explanation (“an oppositional stereotype enhancement effect”) based on a comparison that does not happen to be statistically significant (assuming that we calculated the t statistics correctly here) and then they take non-significant comparisons (with this tiny N=55 study) as indicating no effect.

          There are also various degrees of freedom that were sitting there, not reported in the paper and perhaps not even examined by the researchers, but which would’ve been natural to examine if nothing statistically significant had been found in the above comparison. There were 3 different warmth items, also whatever information they had on the participants in the experiment (perhaps sex or field of study). Even if all that was available were the 4 survey items, that’s still enough for lots of forking paths.

          Also this: “As always, additional studies arein order and should manipulate warmth and competence of other groups withevaluatively-mixed stereotypes, including those from the competent, cold cluster.” This is the unfortunately common step of extrapolating some general truth from whatever happened to turn up statistically significant (or, in this case, not statistically significant, given the calculation error in the t statistic) in some particular sample, and then planning to move forward, never checking to see if these exciting counterintuitive findings could be nothing but noise.

          And, get this: according to Google Scholar, that paper has been cited 470 times! Imre Lakatos is spinning in his grave. Paul Meehl too. They’re both spinning. I’m spinning too, and I’m still alive.

          How did you happen to notice this?

        • Andrew, you need to correct an error in your reanalysis of Amy Cuddy’s work. I first noticed the error in the talk you gave at Ohio State a couple of weeks ago. In that talk you noted how Cuddy and her colleague (Micheal Norton and Susan Fiske) had reported a miscalculated t-test. It is true they miscalculated the t-test (and their numbers are way off), but even though you correctly calculated the t-tests they report, those t-tests are not the most appropriate analysis. In this situation with a three group Analysis of Variance, which is the analysis they conducted, one should first demonstrate that the F-test for the overall analysis is significant, which they did do properly. To compare the means, however, one should not simply do a t-test between the means. As Fischer demonstrated about 80 years ago the best test (unless there is some reason to suspect the variance differs in the three groups and I can’t see any reason to suspect that in this case) is to compare the means with a follow up F-test that uses the within subjects error term for all three groups in the denominator. As Fischer demonstrated this procedure does not inflate experiment-wise error above alpha. Those F-tests and not the t-tests that Cuddy and her colleagues tested and which you recalculated are the best test of the hypothesis and the t-test that you noted that has a value of t=1.79 and p = .08, might well be significant with the proper testing procedure that Fischer developed so long ago. As you no doubt no the pooled variance because it based on including another third of the participants included in the estimate of the error term will on average across tests be more sensitive and this is part of the reason that Fischer recommended this procedure in this type of situation.

        • Steve:

          Just to clarify, I would not say that I have every performed a reanalysis of Cuddy’s work. All I did was recalculate a t-statistic, and that was just to check what Nick had sent me. Given all the researcher degrees of freedom in Cuddy’s paper, I think any reanalysis would have to start with the rawest of raw data and then consider all possible comparisons of interest. It would be a lot of work and, I think, not worth the effort.

          As has been said many times, there is a tradeoff between effort in design and effort in analysis.

        • Yes, but Andrew you should acknowledge that the t-test is not the most appropriate analysis here, No? And further your statement that has been repeated many times that the one comparison is not significant is potentially misleading. It may well be significant with the most appropriate analysis. At a minimum I think you should quit making that charge and correct the record by stating that whether that comparison is significant or not would require additional analysis.

        • Steve:

          I really don’t know what you’re talking about. You write, “your statement that has been repeated many times that the one comparison is not significant,” but I never said anything about a comparison being significant. What I said was that they reported t statistics of 5.03 and 11.34, but the correct calculations give 1.79 and 3.34. I never said that I recommended this analysis, I just reported the numbers. You say I “quit making that charge and correct the record,” but there’s nothing for me to correct. This whole exchange is just weird. If you have a problem with miscalculations of t statistics, you should take it up with Cuddy, Norton, and Fiske—they’re the ones who lost control of their own data!

        • Hi Andrew,

          I guess I am reacting to your talk at OSU in which you were criticizing Cuddy and colleagues for not responding to evidence and particularly this particular piece of evidence that the comparison for the t-test was not significant and yet they still talk about their results as if the new analyses don’t change their interpretations. Your clear implication was that because that contrast wasn’t significant they should change the way the talk about the study. You made a similar charge against Susan Fiske based on the same analysis when your were responding to her editorial in Perspectives in Psychological Science. What I am trying to point out is that contrary to your claims both at the OSU talk and in your response to Fiske we do not know without further analyses if the conclusions they made in the paper should change or not. You suggested they should change their conclusions. I think this suggestions is way premature until the more appropriate analyses are done. It might well work out that they are right that when the analyses are done properly their is no reason to change their interpretations.

        • Steve:

          I really really really don’t care if a particular comparison in the paper by Cuddy et al. is “statistically significant” or not. I’m guessing Cuddy et al. did care about this, but I really don’t. As I said in my talk, I see the sloppiness in their data analysis, and the fact that they don’t reassess their conclusions when people point out their errors, as related to the larger point that they have these vague flexible hypotheses and vague flexible data analysis strategies that allow them to claim success from just about any data.

          You write, “Your clear implication was that because that contrast wasn’t significant they should change the way the talk about the study. You made a similar charge against Susan Fiske based on the same analysis when your were responding to her editorial in Perspectives in Psychological Science.” I’m sorry but I never said that. I don’t think there’s anything special about statistical significance.

          You’ll just have to go with what I say and what I write, not on “clear implications” that are coming from you, not me.

        • Andrew:
          It is clear that you really really really don’t compare about this comparison and in fact that you care so little about it that you can’t even be bothered to do the most appropriate analysis or even consider it. From where I sit it looks like Cuddy, Norton, and Fiske botched the analysis, but it looks like you botched the analysis too and then made a big deal about them botching the analysis and not responding to your botched analysis. That might make Cuddy, Norton, and Fiske sloppy in their initial analysis, but to me it looks like you can’t even be bother to get it right before you expect people to respond to your analysis. Yes, Cuddy and colleagues shouldn’t have botch the analysis, but neither should have you botched the analysis and it seems a bit hypocritical to chatise other people for botching an analysis that you then botch as well and then further criticize them for not revising their interpretation based on your botched analysis.

        • Steve:

          OK, now we’re getting somewhere. You write that I “botched the analysis.”

          So let’s be clear: I did not “botch the analysis.” I didn’t analyze their data at all. All I did was recalculate a couple of t-statistics, and I just did that to check a recalculation that someone, I think it was Nick Brown, had already done.

          If you can tell me what you think I actually botched, that would be a start. You first seemed to say that my error was in thinking that a t-test was “the most appropriate analysis,” but I never thought that (nor did I say it). Then you said that I “should acknowledge that the t-test is not the most appropriate analysis here.” I’m happy to acknowledge this, especially as I never made that claim! Then you said, “whether that comparison is significant or not would require additional analysis,” which is fine, I never claimed this one way or another. Then you criticized what you called “a clear implication” of mine, but again it was something I never actually said or wrote or thought.

          So, no, I don’t see any botched analysis of mine. All I see is a calculation which you yourself said was correct. I never performed a reanalysis of these data, nor did I ever claim to.

          As I said, this whole think is kinda weird to me. Nick Brown pointed to an error in that paper, an error in which t statistics were reported as 5.03 and 11.34 but were actually something like 1.79 and 3.34. I did some calculations to confirm this (under some assumptions), and then in a talk I pointed this out. No botched analysis.

        • Andrew:

          The botched analysis in a three group design is comparing the individual means with a t-test instead of and F-test in which the denominator is the pooled variance across the three groups. As I pointed out in my first post above Fischer ages ago wrote on just this situation and argued effectively that the F-test with pooled variance is the most sensitive analysis and does not inflate experiment-wise error. Cuddy, Norton, & Fiske should not have been comparing the individual means with a t-test and they should have computed the t-test properly. In evaluating the analysis Nick Brown and you, in my view, should have realized that the most appropriate way to analyze the data here was not to just recompute the t-test, but rather to recommend the analysis suggested by Fischer ages ago. In my view, it is botching the analysis to miss this pretty basic point. By recomputing the t-test without recognizing it really isn’t the best way to analyze the data you aren’t correcting and in fact are reiterating the botch analysis principle that a t-test is the way to do the comparisons in this situation. So, yes, in my view you botched the analysis here by not recognizing that recomputing the t-test isn’t the right thing to do. Instead, one should be analyzing the data with an F-test with pooled variance. Yeah, you computed the t-test properly, but you missed (or at least didn’t point out, which is as bad in my book) that it was the wrong analysis to do. A proper analysis is not only doing the math right, it is of course, doing the most appropriate analysis and you never pointed out that the t-test that you recomputed wasn’t the most appropriate analysis.

        • Steve:

          OK, now I get it. You are unhappy that, in my blog comment and in my talk, I didn’t ever point out that the t-test that you recomputed wasn’t the most appropriate analysis.

          I’m happy to point that out now.

          Also, you write, “one should be analyzing the data with an F-test with pooled variance.” I actually don’t think an F test is the most appropriate analysis, either. But if you do an F test in one of your papers, I won’t say you botched the analysis, I’ll just say that you did an analysis that I do not recommend. To get a sense of the sorts of analyses I prefer, you can take a look at my books on multilevel modeling and Bayesian statistics.

        • Steve:

          Maybe next time I discuss such an example, I will add, “Not that I’m endorsing their analysis; I’m just saying that, conditional on that analysis being done, the calculations were in error.” Certainly no harm in making this clear.

        • Andrew:

          Fair enough, but when we are talking about the difference between a Baysian analysis and a frequentist analysis, we are talking about a very different approach to the analyses. Whereas the difference between a t-test and and F-tests we are talking the same approach to the analysis. In the same way that if an author is doing a Baysian analysis and was using a pretty suspect prior and a much more defensible prior was available, one would expect the Baysian critic to challenge the prior, in this analysis when the authors were using a t-test and a more sensitive analysis with an F-test was available I would expect you as a critic to recommend the more sensitive analysis with error based on the larger sample. If you would have taken a Baysian approach to your critique I would have had no problem, but when you took a frequentist approach I would expect you to do the more appropriate frequentist analysis. I still think to not do so was to botch the analysis. Further, to take the frequentist approach and then to argue that you don’t think it matters whether the effect is statistically significant is misleading. You have to know when you reported the one comparison as p = .08 and you took a frequentist approach that many people would read that finding (rightly or wrongly) as saying that Cuddy, Norton, and Fiske did not find evidence for the claim they made, and they needed to update their interpretation because of that change in significance. The truth, however, is that with the more appropriate frequentist analysis may well lead to the same interpretation if you are taking a frequentist/null hypothesis testing approach. So, when Cuddy and Fiske say that the analysis error ( the miscalculation you noted) when corrected didn’t change the interpretation of their results they may well have been just following the typically frequents/null hypothesis testing approach and any beef you have with their not updating their interpretation, which is a beef you certainly do seem to have, isn’t a beef with their failure to do statistics properly within their approach and their vague hypothesis, but really a beef with taking a frequentist approach. I don’t see their hypothesis as vague here at all, what I see is that they may well have been following a frequentist approach and from that approach the correction of their miscalculation does not change the interpretation of their data. Said another way, I don’t think there is good evidence here that they are being vague and overly flexible with their analysis here, instead it appears they may well have just been following a frequentist approach and coming to the proper conclusions that that approach would suggest. Your overall analysis that they aren’t adjusting their interpretation seems like quite a stretch. So, perhaps you can admit not only that the t-test you computed was not a very good way to approach the data, but also that your conclusion that they went weren’t willing to adjust their interpretation based on this not very good reanalysis went too far and was a stretch.

        • Steve:

          No, I never “reported the one comparison as p = .08.” I never computed any p-values at all! You are perhaps mixing up something you heard in my talk with something you read somewhere else.

        • @ Steve

          I don’t take everything Andrew says or does as gospel, but your criticisms of Andrew in this instance don’t hold water for me — there is too much that sounds like you are claiming you can read his mind and too much that sounds like you are trying to hold him to criteria that (from my perspective) seem idiosyncratic to you.

    • That wasn’t an exact replication. They mention in the link you posted:

      The poses and procedure for collecting the saliva samples were identical to the original study. However, a facial emotion task was included in this study. Although the task did not change the amount of time between the pre- and post-saliva tests, it did increase the amount of time a pose was maintained from 2-3 minutes to 10-12 minutes.

  19. In the original TED talk that got so many hits, Cuddy removed the uncertainty estimates from the comparisons of means. The error bars were in the paper (Fig 3), but she went to the extra effort of removing the error bars in the talk. I guess she didn’t want to fluster her audience with uncertainty estimates.

    Also, when I had read the paper some time back, it was clear they had monkeyed around with subject removal and so on. I found it really odd that their key effect was reported as p<0.05 but that the correct rounding up to two digits would have been 0.05, not less than 0.05.

    They wrote:

    "As hypothesized, high-power poses caused an increase in testosterone compared with
    low-power poses, which caused a decrease in testosterone, F(1, 39) = 4.29, p < .05;…"

    R delivers a p-value of 0.045. If I round to two digits, it would be 0.05. It seems like a trivial detail, but it would have been so much less impressive to say p=0.05 (accompanied by a silent "phew!"). That's the difference between 30 million hits and nothing, folks.

    I also find it really depressing that Cuddy describes herself as a "public figure" on her facebook page (where I went to look for her offending post about Andrew being a bully). When scientists start to see themselves as celebrities first, it's time to quit and move on. I sometimes feel that Entsophy's (or is it one of the Anonymous guys now?) aggressive and out-of-control ravings against professors has not been entirely without merit.

    Related:
    One of my students recently commented (maybe he got it from somewhere) that the key test of whether you understand the p-value is whether you are still using it. If you are using it, it is almost certain that you don't understand it. I want to use that quote in a journal article some day.

    • I think the rest of you Professors must share some of the guilt. In such cases, what’s the Association of Psychologists doing? Or the other Profs. at the University etc.? Surely we on the blog aren’t the only Profs. spotting the hype & errors?

      Why you see one Prof. being repeatedly misinterpreted / hyped / misleading etc. why don’t Profs. speak up in organized ways to correct the wrong? Admittedly, blogs are stepping in to do some of this useful function.

      But your Professional bodies, associations, departments & other official fora are totally voiceless & passive at protecting the integrity of the scientific message.

      • Agreed, Rahul.

        What can I do personally? I am not associated with psychology or funding bodies in any official capacity, and the only professional memberships I have are in statistics (there’s a reason for that), with the exception of the Math Psych society, which is not really psychology at all as we understand it on this blog. I can and do point out all the problems in reviews I do. And I also write papers in linguistics and psycholinguistics to try to correct what I consider to be wrong (and I may be wrong about that too!). So, on a much smaller scale than Andrew, and much less in-your-face as the Slate piece is (I can see why Cuddy might react so negatively, it feels so personal).

        I guess there are more powerful people out there who could actually move things away from the current state.

        But I do concede your point. Talk is cheap, and people (including me) just sit on the sidelines, occasionally look up from their RStudio session, criticize someone on a blog comment, and go right back to RStudio-ing.

        What would be an appropriate systemic solution though? Educating social psychologists about statistical theory and applications would be a start, I guess. Is there *any* social psychologist out there doing this right? I am assuming that Darwinian processes would render him/her jobless if they end up with a null result each time, so the field will only be populated by p-barely-below-threshold-by-hook-or-by-crook researchers? OTOH, maybe the people who attract Andrew’s attention are still in a minority and the field as a whole is doing OK (I mean relatively; look at medicine as a baseline, totally messed up). As usual, we need some data on this to assess if intervention is needed at all.

        • > the field will only be populated by p-barely-below-threshold-by-hook-or-by-crook researchers?
          I recently reviewed a published paper by a group of pre-tenured faculty (about 3 to 4 years into their first appointments) and a tenured faculty member who earlier without them had published one of the best methodological papers their field.

          In the application paper I reviewed, the authors claimed to have used that methodology developed by the tenured author but actually had deviated in using the method NOT recommended in that earlier paper as it was not conservative enough. Using that not recommended method got them a few p_values just under .05 whereas the recommended method would have had none.

          I was a bit surprised the tenured faculty member went along with this (being on that application paper) but maybe they did not notice, or thought no one else would notice (the journal didn’t) or maybe they felt they had to take the hit to help these junior faculty keep their jobs. My guess is it was the latter.

    • Shravan:

      After reading your comment I took a look at the paper more carefully and Wow, yes, lots of forking paths:

      To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses.

      A reasonable decision, except that in other papers in this literature this is not always done. Separation by sex alone gives several forking paths: You can analyze the data as is, you can run a regression by sex and consider the coefficient for sex as a control variable, you can look at the coefficient of sex, you can interact treatment effect by sex, you can look at effects separately for men and for women. you can look at one hormone for men and the other for women, etc.

      All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.

      I’m not clear here on what they did. When analyzing cortisol at time 2, which of the other variables (cortisol at time 1, testosterone at time 1, testosterone at time 2) did they control for? Different options would be possible. As usual, the point here is not whether the authors did multiple analyses and picked the best, it’s that they had flexibility in choosing their analyses in light of the data. Indeed, in their paper they never claim otherwise, and various different ways of analyzing these sorts of data are present in the literature, including in other papers by the same authors.

      But it is not clear what was actually done. In the above passage it seems pretty clear that all analyses of hormones were controlling for other hormone measurements, but then the main results that follow don’t seem to have such controls, they’re just basic Anovas.

      Also this, in a footnote:

      Cortisol scores at both time points were sufficiently normally distributed, except for two outliers that were more than 3 standard deviations above the mean and were excluded; testosterone scores at both time points were sufficiently normally distributed, except for one outlier that was more than 3 standard deviations above the mean and was excluded.

      What??? Did they just say that they excluded data from 3 subjects?

      No wonder Simmons and Simonsohn wrote that these results were consistent with no effect. With a small sample you only have to do a little bit of selection and you can get impressive-looking results.

    • “When scientists start to see themselves as celebrities first, it’s time to quit and move on. I sometimes feel that Entsophy’s (or is it one of the Anonymous guys now?) aggressive and out-of-control ravings against professors has not been entirely without merit.”

      It’s important to always blame “The system” and “incentives”. It’s important to never put any blame, or responsibility on the actual professors themselves. Wouldn’t it be great if mere mortals (i.c. non-academics) could all use this as an excuse for our behaviour!

      I am just wondering if Cuddy et al. have done any replications, or follow-up research themselves. If i were to have given a TED-talk about power-posing saying that people should tell this stuff to other people who “could benefit from it”, i would want to make sure that what i am telling people is correct. For me that would mean not doing a TED-talk in the first place with so little “evidence”, but aside from that it would mean doing lots of replications and follow-up research on the topic. The only related follow-up research i could find was this one:

      Cuddy, Amy J.C., Caroline A. Wilmuth, Andy J. Yap, and Dana R. Carney.”Preparatory Power Posing Affects Nonverbal Presence and Job Interview Outcomes.” Journal of Applied Psychology 100, no. 4 (July 2015): 1286–1295.

      I also wonder what her file-drawer looks like…

      • This paper had a higher sample size (61). They also got not on the border p-values this time (on the nice side of the threshold).

        And from the cited paper:

        “The difference between high-power and low-power posers’ self-reported feelings of power (high-power: M = 2.47, SD = 0.93; low-power: M = 2.04, SD = 0.93) was marginally significant, F(1, 60)=3.258, p = .076.”

        Actually, why *shouldn’t* one consider a p of 0.05 or 0.07 or even 0.10 or 0.20 significant? It would make life much easier for publication, at least.

        • Alpha is the expected value of p for each field. If you can collect more and cleaner data, alpha gets small like in particle physics to 3e-7. If your data is really messy and small then p gets larger (probably something like small-scale ecology studies use alpha=0.1). In most cases a moderate balance is apparently struck at about alpha=0.05.

        • “The difference between high-power and low-power posers’ self-reported feelings of power (high-power: M = 2.47, SD = 0.93; low-power: M = 2.04, SD = 0.93) was marginally significant, F(1, 60)=3.258, p = .076 (d = 0.46, ηp2 = .053) (see Table 2).”

          Followed by
          “This finding is consistent with past research showing that power posing has a weak impact on self-reported feelings of power despite its stronger effects on cognitive and behavioral outcomes”

          Yes, because a p value of .076 is exactly the same thing as an indication of a weak effect.

        • Nick:

          Good catch. I’ve seen this mistake before but have never thought about it much. Now I’m thinking it’s part of a whole mistaken worldview, what Tversky and Kahneman called the belief in the law of small numbers.

          Also interesting to think about how Carney, Cuddy, and Yap would react to all the information in this thread. My guess is that their inclination would be to fight back or to just ignore us and hope we go away. But if they really care about the effects of posture on success, and if they really think their research is relevant to such questions, they should want to know what they got wrong, no?

          This is the part that still puzzles me: I can only assume that these researchers do think this topic is important and that their research is relevant. I have no reason to think that they’re hustlers, cackling all the way to the bank. And I assume they also think their data support their claims; I have no reason to think that they’re Marc Hausers, playing shell games with monkey videos. But still I can’t picture them engaging with any of this criticism, especially given their essentially empty responses to Ranehill et al. and to Simmons and Simonsohn.

          So how do Carney, Cuddy, and Yap see this? It’s harder to say for Carney and Yap but they did coauthor the article and the response to Ranehill et al. so I assume they’re with Cuddy on this one, just keeping a lower profile, which reduces both the upside and the downside of being associated with (a) this research and (b) worse, the tenacious defense of this research.

          I’m guessing that they think that everything where the p-value in their data was less than .05 (and everything where the p-value was more than .05 but where they mistakenly computed as being under that threshold) represents a large and true effect.

          And that when they computed the p-value as greater than .05, they believe that, as the experimenters, they get to choose whether this represents a real and small effect, or that it represents no effect.

          I’m guessing they also believe that all their conclusions are (a) interesting, surprising, and novel, and (b) perfectly consistent with their underlying theory. And that the particular analyses they published were the only analyses they performed on the data, and that had the data been different, their data-selection and data-analyses rules would’ve been the same.

          And I’m guessing they think that all of us—you, me, Simmons, Simonsohn, Ranehill, Dreber, etc.—have an ax to grind and that nothing we say is worth listening too.

          And, indeed, if you accept the (statistically unjustified and implausible) view that everything they saw in their sample reflects general patterns in the population—then all our criticisms are indeed irrelevant. If they already know the truth, then any statistical or methodological criticisms are meaningless technicalities, and any failed replications just indicate that the replicators don’t understand what they’re doing.

          For Carney, Cuddy, and Yap to move forward, they have to accept that they might be wrong, they need to accept the principle that the sample is not a perfect reflection of the population.

          In this case I think we’d all be better off had the concept of statistical significance (or other misusable techniques such as Bayesian posterior probabilities) had never been invented. Cos then Carney, Cuddy, and Yap would have to face up directly to their assumption that sample is like population, they couldn’t hide behind statistical significance and use a naive view of statistical significance as an excuse to dodge all criticism.

        • Andrew, are Bayesian posterior probabilities misusable because one can make the same kind of binary decisions as with p-values? Or is it something else entirely?

          About your other points: just remember that they probably don’t even know they did anything wrong in their analytical methodology. That’s probably why Cuddy called you a bully.

          This is how I started out doing statistics too, when I had to analyze data for the first time in 1999 or 2000 or so. I remember the feeling of puzzlement when statisticians start criticizing you. They might be in the same boat.

        • I have the strong suspicion that if p values were made illegal by an act of Congress tomorrow and journal articles had to report Bayes factors instead, within a very short time you would have some magic value (e.g., 10) that became the new p=.05. Maybe BFs are less easy for the lazy or clueless researcher to hack than p values, but I think we need to acknowledge that the sociology of the research and publishing process plays a larger role (and theoretical deficiencies a corresponding smaller one) in the problems associated with p values than we might like to admit. Journals, especially in psychology, are going to continue to want nice, neat, yes/no decision criteria, and the news media that consumes this stuff even more. If they can’t start an article with “Scientists have discovered that…”, they won’t be interested. (We got a good look at the realities of the knowledge production process this last few days with the Fifth Quarter Fresh thing.)

        • The vast majority of academics realize their entire career amounts to gaming the system so that taxpayer dollars shield them from never having to get anything right. Every graduate office I’ve ever seen has “how to” guides written by senior academics on how to mold their career to achieve exactly that.

          For example, in Economics they’ll tell you to find any unused novel data set so you can to write a gimmicky paper that lands in a top journal. This has to be done by a certain point or you’re toast in Econ. Whether the paper uncovers a truth or not is irrelevant.

          The ones I feel sorry for though, are the ones who drank the coo-laid on Frequentist statistics and genuinely believed they were learning a difficult and esoteric skill which allowed them to do “real” science. Now when it’s too late for them, they’re finding out not only was classical statistics a con, but everyone except them is wise to the con.

        • >The ones I feel sorry for though, are the ones who drank the coo-laid on Frequentist statistics and genuinely believed they were learning a difficult and esoteric skill which allowed them to do “real” science..
          > not only was classical statistics a con, but everyone except them is wise to the con

          Objection: as I understand it, the issue is that classical statistics runs into problems with low-N, low-power studies. Careful design and adequate sampling in fact do allow for sound Frequentist inference and “real science”.

          Please someone correct me if I’ve got this wrong?

        • Insisting on high powered tests improves Frequentist procedures because it get’s them closer to the Bayesian answer. But it doesn’t go all the to being Bayesian. Consequently, it’s trivial to produce tests (by Neyman-Pearson’s definition of a test) which have high power and low alpha, but which yield ridiculously bad and obviously wrong results.

        • For some reason there’s this strong force pushing people to claim some version of “Frequentist isn’t wrong, it just hasn’t been applied with enough fanatical zeal”.

          This is so contrary to the past 100 years of statistical history it just blows my mind everyone keeps insisting (without evidence) it’s true.

        • >For some reason there’s this strong force pushing people to claim some version of “Frequentist isn’t wrong, it just hasn’t been applied with enough fanatical zeal”.

          What is this straw man you’re pursuing?! If I understand correctly, within the (bio)medical community, much of the historical, and current, *and let’s not forget, overall highly effective* conceptual infrastructure is based on Frequentist & Type I/II thinking (including “sensitivity” and “specificity”). So we’re talking NIH, FDA, MD/PhD researchers, and practicing + teaching MDs. This must comprise 10Ks of people within this country alone.

          If there were a significant “effect size” to your claim (“Frequentist is wrong vs Bayesian”), that were unconditionally applicable across the board, I daresay we would see medical science “in crisis”, along the lines that’s being exposed in the social sciences.

          -> Can you cite evidence that that medical science is in statistical crisis, overall, and therefore dangerous?
          -> Per your claims, should we not be observing widespread scandal within oncology, for example? Pharma bankruptcies due to “wrong” Frequentist efficacies claims? Droves of cancer doctors being tried for malpractice and losing their licenses, due to Frequentist quackery?

          >This is so contrary to the past 100 years of statistical history it just blows my mind everyone keeps insisting (without evidence) it’s true.

          What do you mean *precisely* by this? Less heat, please, and more light?

        • >”If I understand correctly, within the (bio)medical community, much of the historical, and current, *and let’s not forget, overall highly effective* conceptual infrastructure is based on Frequentist & Type I/II thinking (including “sensitivity” and “specificity”). So we’re talking NIH, FDA, MD/PhD researchers, and practicing + teaching MDs. This must comprise 10Ks of people within this country alone.”

          I am not the same person as “Anonymous”, and would attribute this problem to NHST* rather than frequentism per se. But yes, this is an accurate description of the problem. Taken to the logical conclusion we need to redo everything from at least ~1980 when the majority of those not trained to do NHST had retired/died. In certain areas we need to go back even earlier (depending on when NHST became prominent). So many people have wasted their lives confusing each other due to that practice it is just unbelievable.

          Fisher even predicted it:
          “We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”
          http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf

          *NHST: use the default strawman nil-null hypothesis not predicted by a theory then interpreting rejection of that to mean your research hypothesis is true.

        • Anoneuoid:

          Yeah, that’s my point. “NHST” is the bad thing. P-values are one way to implement NHST, and Bayes factors are another. The problem to me is with the NHST, the idea of false positives and false negatives and all that crap. Not with exactly how it’s done.

        • Also wrt “I daresay we would see medical science “in crisis”, along the lines that’s being exposed in the social sciences.” There has been discussion on this recently (I’m sure I have missed some):

          Sept 2011: Bayer reports ~25% of published biomed research results are reproducible.
          http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html

          May 2012: Amgen reports ~10% cancer results are reproducible.
          http://www.nature.com/nature/journal/v483/n7391/full/483531a.html#ref4

          June 2015: A cancer reproducibility project was initiated, but after awhile people start getting disenchanted and leaving because it is not even worth reproducing such flawed research.
          http://science.sciencemag.org/content/348/6242/1411

          June 2015: It is estimated that half of all US biomed research funds are wasted on not even reproducible in principle reports.
          http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165

          December 2015: Continuing info on the cancer reproducibility effort: A quarter of the initial studies needed to be dropped before replications because it was so difficult to get methodological information and materials.
          http://www.nature.com/news/cancer-reproducibility-project-scales-back-ambitions-1.18938

          Andrew wrote:
          >”Yeah, that’s my point. “NHST” is the bad thing. P-values are one way to implement NHST, and Bayes factors are another. The problem to me is with the NHST, the idea of false positives and false negatives and all that crap. Not with exactly how it’s done.”

          I agree.

        • “The vast majority of academics realize their entire career amounts to gaming the system so that taxpayer dollars shield them from never having to get anything right. ”

          Where is the data to back this claim up? I know a few people like that, whom I am sure are just gaming the system as you describe, but I know a larger proportion that I know is not.

          “Every graduate office I’ve ever seen has “how to” guides written by senior academics on how to mold their career to achieve exactly that.”

          “in Economics they’ll tell you to find any unused novel data set so you can to write a gimmicky paper that lands in a top journal. This has to be done by a certain point or you’re toast in Econ. Whether the paper uncovers a truth or not is irrelevant.”

          I have never seen anything like this in my life. Can you provide a scan or a link to any such claim anywhere in any academic department?

          These kind of claims can only be seen as bogus without any hard evidence to back it up.

        • What academia desperately needs is a period of chaos. A genuine unstructured free for all.

          It’s a bit like annealing (or Simulated annealing to the machine learning people). By raising the temperature and making things more chaotic there is a chance the system will cool back down to a better optimum.

  20. ” First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). ”

    If i am not mistaken, Cuddy et al. do not seem to have a problem with holding the poses for 6 minutes in their other research:

    http://faculty.haas.berkeley.edu/dana_carney/pp_performance.pdf

    “Participants maintained the poses for a total of five to six minutes while preparing for the job interview speech”

    • Maybe I’m missing something, but why is Cuddy saying in the first place that the discrepancy was between a 2-minute pose and a 6-minute pose in Ranehill et al., one of the Ranehill et al. co-authors (up thread) characterized it this way: “With respect to the time spend in the poses we had participants hold the poses for 3 min each, instead of 1 min each as in Carney et al.”

      Has Cuddy multiplied the time by 2 for each study?

      (Note: the actual studies linked here describe the time as 1 minute and 3 minutes, respectively, not 2 minutes and 6 minutes: datacolada.org/2015/05/08/37-power-posing-reassessing-the-evidence-behind-the-most-popular-ted-talk/).

    • Brad:

      1. I think there are applications for which the concept of false positive and false negative, sensitivity and specificity, ROC curves, etc., are relevant. There are actual classification problems. I just don’t think scientific theories are typically best modeled in that way. I don’t think it makes sense to talk about the incumbency advantage or power pose being true or false, for example.

      2. Even when “false positive and false negative” don’t make sense, the larger concepts represented by sensitivity and specificity can still be relevant. I tried to capture some of this with type M and type S errors but that framework is a bit crude too. I think it’s a good idea for us to continue working to express mathematically the ideas of scientific hypothesis building and testing without getting trapped in a naive acceptance/rejection framework. This theorizing was fine in the 1940s—statistical researchers had to start somewhere—but we can move on now.

    • >”Surely there should already be an instructive, sensational court case?”

      SCOTUS ruled against NHST six years ago:
      “For the reasons just stated, the mere existence of reports of adverse events—which says nothing in and of itself about whether the drug is causing the adverse events—will not satisfy this standard. Something more is needed, but that something more is not limited to statistical significance and can come from “the source, content,and context of the reports,” supra, at 15. This contextual inquiry may reveal in some cases that reasonable investors would have viewed reports of adverse events as material even though the reports did not provide statistically significant evidence of a causal link.”
      http://www.supremecourt.gov/opinions/10pdf/09-1156.pdf

      There may be earlier or better examples. I suspect there is probably a long history of courts ruling NHST inadequate, but this is well outside my area of research expertise.

  21. Anoneuoid: thank you for taking time to research and provide links. I don’t feel the citations decisively support your very strong claims, but they’re very interesting in their own right.

    Regarding the Supreme Court case you mentioned, I didn’t get too far into it before finding the following:

    “Matrixx’s premise that statistical significance is the only reliable indication of causation is flawed. Both medical experts and the Food
    and Drug Administration rely on evidence other than statistically significant data to establish an inference of causation”

    I have so far not come across any Frequentist or NHST teachings that draw a link between significance and causation. In fact, the texts and teachers I work with have all stressed SCOTUS’ point precisely.

    Under high pressure, many people will choose to “Say something, anything!” Matrixx tried a junk-science approach and lost.

    • >”I have so far not come across any Frequentist or NHST teachings that draw a link between significance and causation.”

      I’m pretty sure if we actually check, that we will find this is common. Can you give some examples that are available online?

      Everyone doing medical research that uses NHST does this. I went and checked NEJM and picked the first paper I saw. Amazingly, here it is in the very first sentence of the very first paper I checked:
      “In previous analyses of BENEFIT, a phase 3 study, belatacept-based immunosuppression, as compared with cyclosporine-based immunosuppression, was associated with similar patient and graft survival and significantly improved renal function in kidney-transplant recipients.”
      http://www.nejm.org/doi/full/10.1056/NEJMoa1506027

      No, you cannot accept your favorite hypothesis that “blatacept-based immunosuppression with improve renal function” just because there was a statistically significant difference between that group and the controls. Even looking later in the paper they describe other reasons for this difference (drug administration orally at home vs IV at the doctor).

        • BENEFIT is apparently an acronym for that project, I did not add emphasis. There were also at least two typos in my post: “blatacept-based immunosuppression with improve renal function”. That should be “belatacept-based immunosuppression will improve renal function”. I’m not sure if that contributed to the confusion.

          The first sentence of the paper I quoted, along with the lack of investigating other explanations for the “improved renal function”, indicates my point. The authors have taken a significant difference between the two treatment groups to mean “immunosuppression due to belatacept treatment has caused improved renal function.” This is despite that they are clearly aware of other differences between the groups.

          Also, I bothered to look a bit further. If you follow the citation regarding renal function assessment[1], you can find the difference between two groups is equal to the average difference between male/female or white/black (plus who knows what other subgroups). In fact, the estimate of renal function is determined by someones race/sex (eg take some serum measurement multiplied by .8 if the patient is a female). So there are likely many reasons for such a deviation from the null hypothesis that need to be investigated.

          My point is, I did not need to look far to find someone rejecting a null hypothesis and accepting the research hypothesis. This paper is in no way exceptional. Where are these people learning it?

          [1] http://www.ncbi.nlm.nih.gov/pubmed/10075613

        • Thank you for clarifying and for your further investigation. I’m starting to understand where you’re coming from. Definitely agreed that researchers need to take statistical competence seriously!

          >I did not need to look far to find someone rejecting a null hypothesis and accepting the research hypothesis. This paper is in no way exceptional. Where are these people learning it?

          To drop a few names : Taleb, Kahneman, and Pinker have all written extensively on the theme that proper statistical thinking is highly non-intuitive for most people. Andrew Gelman touches on this as well, here and there in his blog. The widespread availability of cookie-cutter tools and tests makes it very easy to “cut to the chase”. To be generous, perhaps people don’t realize that they don’t understand? To be realistic, it’s the rare person who doesn’t cut corners *somewhere*..

          We might imagine a better system, modeled on the construction industry, in which research papers would require sign-off by a licensed statistician. This seems likely to be accepted practice sometime in the future ;)

        • >”We might imagine a better system, modeled on the construction industry, in which research papers would require sign-off by a licensed statistician.”

          The problem with the usual approach is there is a many to one mapping of research hypotheses to the statistical “alternative hypothesis”, so rejection of the null hypothesis can only lead to affirming the consequent errors.

          This can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis (it is obvious to anyone not trained in NHST that it should always have been this way). Then, when the research hypothesis/theory has survived a strong test (due to the precision and accuracy of the prediction), we tend to believe it has something going for it. So I don’t think your idea will work since this is not actually a problem in the realm of statistics. See eg fig 2 here:

          Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It. Paul E. Meehl. Psychological Inquiry. 1990, Vol. 1, No. 2, 108-141. http://www.psych.umn.edu/people/meehlp/WebNEW/PUBLICATIONS/147AppraisingAmending.pdf

        • Thank you for the interesting link.

          >..So I don’t think your idea will work since this is not actually a problem in the realm of statistics.

          Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)

        • >”Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)”

          The typical end user of statistics has no conception of a distribution and uses SEM error bars rather than SD because they are narrower. They will have to take the time to educate themselves, there is nothing we can do besides ignore their analyses.

          Still, I doubt you can find a recent statistics textbook that gives an example of some form of “significance/hypothesis test” and not make the error I described above. I have seen that Student makes it, Neyman makes it, but interestingly I have never seen Fisher fall prey to it.

  22. There’s further discussion of the Slate piece on the ISCON Facebook group where Paul Coster accuses people against hyping un-peer-reviewed work of hypocrisy for not critiquing the Gelman and Fung piece, because it discusses a blog post (apologies for my first post forgetting that you wrote it, where I refer to the authors as “journalist”).

  23. Anoneuoid>I doubt you can find a recent statistics textbook that gives an example of some form of “significance/hypothesis test” and not make the error I described above.
    >..can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis

    Bock et al, “Stats : Modeling the World” 4e (2015) is an excellent AP-level HS text. From Chapter 19, “Testing Hypotheses About Proportions”, p. 497:

    “In statistical hypothesis testing, hypotheses are almost always about model parameters.. The null hypothesis specifies a particular parameter value to use in [the] model.. We write (Hsub0) : parameter = hypothesized value. (HsubA) contains the values of the parameter we consider plausible when we reject the null.”

    • I was unable to get access to the book. But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

    • Not sure if these are official, but check out ppt 21 slide 5:
      -There is a temptation to state your claim as the null hypothesis.
      –However, you cannot prove a null hypothesis true.
      -So, it makes more sense to use what you want to show as the alternative.
      –This way, when you reject the null, you are left with what you want to show.
      https://mhsapstats.wikispaces.com/BVD+Powerpoints+and+Chapter+Notes

      Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”, ie the student is taught to make the basic logical error of affirming the consequent.

      • I realized something really disturbing here. This stats teacher has learned that high school students find it “tempting” to test their actual research hypothesis. This is consistent with my personal experience (although after HS), I still remember being in their shoes and also finding it “tempting” to test my actual hypothesis rather than some other default hypothesis.

        As early as high school, students are thinking scientifically but are being specifically told to stop doing so during class. This must be happening all across western civilization at earlier and earlier ages. I really didn’t believe that my estimation of degree of damage being caused by NHST could get any worse, but I had never considered that the age at which people start learning this stuff is also becoming younger.

      • >But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

        With respect, it sounds like you may be moving the goalposts here. I provided a specific example, which you doubted I could provide. All three of the my text’s authors are award-winning educators. It seems a bit presumptive to suggest (particularly from safe anonymity) that they actually don’t understand what they’re teaching, and are in fact misteaching statistics.

        >Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”…

        I’m not sure what your criticism is? The “claim” is a well-defined algebraic statement involving a parameter value, as you insisted earlier.

        >..ie the student is taught to make the basic logical error of affirming the consequent.

        There is a sidebar on the page I referenced above, right next to the paragraph I quoted. In large font is printed Fisher’s classic quote, ending with:

        “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

        As far as I can tell, Bock et al’s logic directly follows Fisher’s. Now, are you claiming that this type of classic NHST inference is logically deficient? I’m having trouble finding authoritative discussion on the web in support of your thesis. Let’s please work out exactly what you mean. What do you think of the next paragraph, which I have formulated on my own, following the reference below. I believe it summarizes NHST inference, in the form of “If P, then Q”

        If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.

        Per reference below, “Affirming the consequent” follows the form:

        If P, then Q.
        Q.
        Therefore, P.

        Would you please demonstrate via simple example how an NHST deduction, following the lines of my conditional above, might fall prey to the fallacy?

        Reference: https://en.wikipedia.org/wiki/Affirming_the_consequent

        • >”With respect, it sounds like you may be moving the goalposts here.”

          Sorry for the miscommunication. I originally asked for ‘an example of some form of “significance/hypothesis test”’, I later again asked: “Can you give some examples of these[example problems and what conclusions are drawn or asked of the student]”? These are meant to ask for the same thing, ie an actual example of the testing procedure being applied. Since it is not (easily) possible for me to check this book myself, I asked for additional info so that my initial request could be met. There is no moving the goalposts.

          >”Bock et al’s logic directly follows Fisher’s…If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.”

          1) The first issue with what you have written is only tangential to my point and I do not wish to focus further on it (since it has been a major a distraction from the main problem), but the alternative hypothesis was not a part of Fisher’s logic. You appear to be working with a hybrid of Fisher and Neyman-Pearson. You can search “NHST hybrid” for much discussion on this, but for an introduction to this phenomenon see: Mindless statistics. Gerd Gigerenzer. The Journal of Socio-Economics 33 (2004) 587–606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf

          Also, here is Fisher on the alternative hypothesis and type II error:

          “It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly from errors in which it is “accepted wrongly” as the phrase does. The frequency of the first class, relative to the frequency with which the hypothesis is true, is calculable, and therefore controllable simply from the specification of the null hypothesis. The frequency of the second kind must depend not only on the frequency with which rival hypotheses are in fact true, but also greatly on how closely they resemble the null hypothesis. Such errors are therefore incalculable both in frequency and in magnitude merely from the specification of the null hypothesis, and would never have come into consideration in the theory only of tests of significance, had the logic of such tests not been confused with that of acceptance procedures.”
          Ronald Fisher. Journal of the Royal Statistical Society. Series B (Methodological). Vol. 17, No. 1 (1955), pp. 69-78. http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf

          2) I linked above to the Meehl (1990) paper where this is explained in (very great) detail, I also gave a real life example of it in action with the NEJM paper. The problem arises from the many-to-one mapping of research hypotheses to a vague statistical alternative hypothesis, that is why I said this issue is “not actually a problem in the realm of statistics”.

          Your example does not include any research hypothesis, only two statistical hypotheses. This scenario does not correspond to any actual use case I can think of. In practice it usually goes like this:
          -A)”If P [the research hypothesis H is true] then Q [the parameter (difference between means) will be greater than zero].
          -B) Q [the null hypothesis that the parameter is equal to zero is unlikely and also the parameter was measured to to be positive therefore: the parameter is likely to be greater than zero].
          -C) Therefore, P [the research hypothesis H is true].

          We can even forget that we have uncertainty about the value of the parameter (ie Omniscient Jones told us the value) and change step B to “Q [the parameter is greater than zero].” As we see, considering this limiting case where our uncertainty approaches zero does not fix the problem.

          Note that if you observe (where, ~ = “not”) ~Q [the parameter is not greater than zero], it is valid to deduce ~P [the research hypothesis H is not true] in this simplified description. In reality though, P is never so simple. The theory is never tested alone, there are also auxiliary considerations A (eg no malfunctioning equipment, etc). So in practice ~P = (~H and/or ~A). In other words, the data or some other assumption can be wrong instead of, or in addition to, the research hypothesis.

        • Thanks for reminding me of this quote of Gerd’s

          “Who is to blame for the null ritual?
          Always someone else. A smart graduate student told me that he did not want problems with
          his thesis advisor. When he finally got his Ph.D. and a post-doc, his concern was to get
          a real job. Soon he was an assistant professor at a respected university, but he still felt he
          could not afford statistical thinking because he needed to publish quickly to get tenure. The
          editors required the ritual, he apologized, but after tenure, everything would be different
          and he would be a free man. Years later, he found himself tenured, but still in the same
          environment. And he had been asked to teach a statistics course, featuring the null ritual.
          He did. As long as the editors of the major journals punish statistical thinking, he concluded,
          nothing will change.”

        • Thank you A.! I sincerely appreciate you breaking this down for me :) Thanks also for clarifying your previous posts and pointing out that I’m thinking in hybrid NHST terms.

          I didn’t want to get further into the textbook before getting all this conceptual groundwork sorted out. Thank you for your patience with me on that. Here is a worked example from Bock et al, 4e (big thanks to ABBYY OCR! ;). It appears shortly after the above-quoted statements about the null hypothesis:

          (begin quote)
          Step-by-step example: Testing a hypothesis

          Advances in medical care such as prenatal ultrasound examination now make it possible to determine a child’s sex early in a pregnancy. There is a fear that in some cultures some parents may use this technology to select the sex of their children. A study from Punjab, India (E. E. Booth, M. Verma, and R. S. Beri, “Fetal Sex Determination in Infants in Punjab, India: Correlations and Implications,” BMJ309 [12 November 1994]: 1259-1261), reports that, in 1993, in one hospital, 56.9% of the 550 live births that year were boys. It’s a medical fact that male babies are slightly more common than female babies. The study’s authors report a baseline for this region of 51.7% male live births.

          Question: Is there evidence that the proportion of male births is different for this hospital?

          Hypotheses:
          The null hypothesis makes the claim of no difference from the baseline.

          The parameter of interest, p, is the proportion of male births:
          Hsub0 : p = 0.517
          HsubA : p 0.517 The alternative hypothesis is two-sided.

          Model : Think about the assumptions and check the appropriate conditions. (content skipped for brevity)

          Mechanics : (content skipped for brevity)

          Conclusion : The P-value of 0.0146 says that if the true proportion of male babies were still at 51.7%, then an observed proportion as different as 56.9% male babies would occur at random only about 15 times in 1000. With a P-value this small, I reject Hsub0. This is strong evidence that the proportion of boys is not equal to the baseline for the region. It appears that the proportion of boys may be larger.

          State your conclusion in context : That’s clearly significant, but don’t jump to other conclusions. We can’t be sure how this deviation came about. For instance, we don’t know whether this hospital is typical, or whether the time period studied was selected at random. And we certainly can’t conclude that ultrasound played any role.
          (end quote)

          Applying your H/P/Q logic, we would seem to have:

          H : The proportion of male live births may be significantly atypical at one particular hospital
          P : H is true
          Q : The appropriate significance-test P-value will be very low, indicating that the proportion for this hospital is unlikely to reflect the population value for all live births in the region.

          So the authors do infer from Q “strong evidence” that H is true. However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice. They also explicitly caution the student against inferring a broader “causal” H from Q.

          My vague memory is that at least some high-profile epidemiological research has concluded in similar fashion : “What causal H could be responsible for Q? Here are some ideas : H1,H2,H3,etc”. In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.

        • > some high-profile epidemiological research
          If it is an epidemiological study there is confounding and bias (or at least these can’t be ruled out) and this implies the distribution of p_values when the null is true is undefined and that makes this claim non-sense “observed … as different as … would occur at random only about x times in 1000”.

          Surprising how many, even those doing teaching and research appear to be unaware of this, see http://www.stat.columbia.edu/~gelman/research/published/GelmanORourkeBiostatistics.pdf

          Now, the book you are quoting does seem to be at least (implicitly) pointing that problem out.

        • Keith, are you aware of anyone that has measured the distribution of p-values that result from samples taken from supposedly the same population under very controlled settings with randomization? I think that could provide a lower bound on the types of deviation from uniform we would expect.

        • Anoneuoid:

          Fisher in the 1930’s with agricultural uniformity trials.

          Now, Peirce in 188? argued that randomization clearly justified that a random sample of a population justified inference.

          More recently, Efron for gene expression studies.

          Epidemiological studies are a different kettle of fish.

        • Thanks for posting the example. Let’s start with this: “However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice.”

          1) First, I went and got the paper[1] and found the authors did not perform this analysis, so this is not an example of something found in practice. Second, I did not claim anything wasn’t “seen in practice”, rather that “scenario does not correspond to any actual use case I can think of”. Indeed, the textbook acknowledges there is no actual use for this test: “That’s clearly significant, but don’t jump to other conclusions.”

          Exactly! ***The extent of the appropriate conclusion is limited to a statement about the statistical significance*** When properly interpreted, the outcome of a default-nill-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions, the only “use” is to proceed to commit any of a number of logical fallacies.

          2) I also found that some of the information in this question appears to be incorrect. Specifically, the number of 550 births in 1993 appears made up and the baseline was from the same hospital ~10 years earlier, not “for the region”. So in terms of real life applicability, this analysis is also an example of GIGO.

          3) Ignoring the above, they still do commit the affirming the consequent error we have been discussing. The problem is that reported/registered live births does not necessarily correspond to actual live births. So the actual sex ratio of births at this hospital could be exactly the same as the baseline, but male births are reported more often for some reason.

          As described in the question, there is a preference for male births in India. Do the hospitals have any incentive to report a greater percentage of male births? Perhaps more patients will choose to go there out of superstition because they hear male births are more common at that hospital, so the high sex ratio amounts to advertising. Or perhaps there is a larger number of male vs female still-births, and still-births reflect badly on the hospital. Then these still-births may get inaccurately recorded as live births. Maybe there are more unwanted female babies that get abandoned at the hospital which would lead to undesirable paperwork for the staff, so instead they just drop them off at the orphanage or find a foster home off the books. Etc, etc. The number of plausible reasons for the mere presence of a deviation from the null hypothesis is essentially limitless.

          4) With regards to: “In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.” The news coverage is based on the magnitude of the change, not mere statistical significance. Also, these are two areas where nothing has been figured despite many years of research. Instead, wild speculation, fraud, and conspiracy theories have proliferated. So I don’t understand why/how you want to to interpret them as success cases for NHST.

          [1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2541782/

        • Anon> ***The extent of the appropriate conclusion is limited to a statement about the statistical significance***
          >the outcome of a default-nil-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions
          > I don’t understand why/how you want to to interpret them as success cases for NHST.

          Andrew> I think there are applications for which the concept of false positive and false negative, sensitivity and specificity, ROC curves, etc., are relevant. There are actual classification problems. I just don’t think scientific theories are typically best modeled in that way.

          Maybe this is really the crux of our difference, Anon.? You seem to be very theory-oriented, and you seem to be an insightful and creative thinker. Your discussion of male births in India was very interesting. I can understand your disdain at the lack of explanatory power in NHST.

          My own statistical goals are very modest. I appreciate having NHST at my disposal, to tell me that something seems to be commonplace, on whatever metric. Alternatively, testing an appropriate statistical hypothesis may signal that something seems extremely unusual, relative to appropriate expectations. In that case, I understand that it’s up to me to do my own thinking from there, with “possibly why?” questions and orthogonal deductive reasoning.

          Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.

          You’ve definitely made your point, though, which I appreciate and acknowledge. Thank you again for the very helpful discussion :) I’ll close with the following quote:

          “Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis.”
          https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Criticism

        • >”Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.”

          Please supply a reference you trust describing how one/some/all of these supposed use-cases is actually implemented. I am certain that upon investigation it will turn out to be some method other than NHST (as I defined it here multiple times) or there is no actual evidence of success. Also, note the pitfall of using the “evidence” produced by the NHST procedure to prove the usefulness of NHST.

  24. Bottom line. Cuddy is not a scientist and what she does is not science. She has her Phd, but we credential folk way beyond their justifiable abilities.

    More pessimistically, if Cuddy is a psychological scientist and her “work” is taken as psychological science, then the domain is showing very significant fraying at its seams.

    the groundwork is (whether psychology realizes it or not) being prepared for a rebirth of the ascendancy of behaviorism. Cuddy and her (many) ilk are spending great amounts of time and effort fertilizing the soil with their dung.

  25. I am not sure where to bring this up, but I think it needs attention. Near the beginning of her TED talk, Cuddy states, “Nalini Ambady, a researcher at Tufts University, shows that when people watch 30-second soundless clips of real physician-patient interactions, their judgments of the physician’s niceness predict whether or not that physician will be sued.”

    I have not found that study. I wonder whether Cuddy may be conflating two separate studies by Ambady: one of surgeons’ tone of voice, and another of soundless video clips of college teachers. In any case, I am skeptical of her use of the word “predict.” It’s unlikely that the study found that judgments of physicians’ niceness (from soundless video clips) predict *future* lawsuit patterns. Rather, I suspect that the study in question, whichever it was, related the videos to the physicians’ *existing* history of lawsuits. (Why do I suspect this? Because the former study would be extremely difficult to pull off.)

    When you misuse the word “predict” in this manner, you fuel the expectation that science will lead to magical findings. Cuddy’s statement has been quoted all over the place.

    I blogged about this (briefly) here: https://dianasenechal.wordpress.com/2016/10/06/what-does-predict-mean-in-research/

    • Diana,

      In statistics, “predict” is sometimes used in a different way from the ordinary usage of “predict something that will occur in the future.” For example, in trying to develop a model that might be useful for prediction of future events, a data set may be divided into a “training” and a “hold-out” set. The training set is used to develop a model, then that model is tested on the hold-out. If the model gives good* predictions on the hold-out data, it has some credibility for prediction in future cases.

      * “good” is often a fuzzy or subjective or conditional-on-context term here– e.g., weather predictions may be using the best available techniques, but still be inaccurate fairly often.

      • Martha,

        Thank you very much for this explanation. This does not seem to apply, though, to Cuddy’s use of the word. I see little likelihood here of anything like a “training” and a “hold-out” set.

        Moreover, the training and hold-out sets would have to work in the same temporal direction, wouldn’t they? If you’re trying to predict a future event B on the basis of A, wouldn’t you start with A and then analyze its possible relation to a subsequent B? You couldn’t, say, establish a relation between doctors’ facial expressions and *past* lawsuits and then use that very model to predict *future* lawsuits. There’s too much ambiguity of cause and effect.

        I could be wrong about this–but it seems to me that Cuddy is conflating two studies, misinterpreting a study, or both. (To find out, I would have to track down the study in question; so far, it has eluded me.)

        If I am right in thinking that she conflated two studies, then I go to the “Surgeons’ Tone of Voice” study and look at Ambady’s wording: “Controlling for content, ratings of higher dominance and lower concern/anxiety in their voice tones significantly identified surgeons with previous claims compared with those who had no claims.” (The association is between tone of voice and *previous* claims.

        Later on in the paper, she writes, “Logistic regressions were performed to examine the contribution of voice tone, beyond the content of speech, to predicting malpractice claims history.” The use of the word “history” here still suggests that she is referring to past claims, not future claims. Any prediction here is retrospective.

        In the “Half a Minute” study, involving soundless clips of teachers, Ambady writes, “consensual judgments of college teachers’ molar nonverbal behavior based on very brief (under 30s) silent video clips significantly predicted global end-of-semester student evaluations of teachers.” Here the word “predict” is used to refer to the future.

        So, in the one study (involving physicians), the word “predict” refers to a relation between the data and past claims the past; in the other (involving teachers), it refers to a relation between the data and future evaluations.

        I find it unlikely that there is a study that “predicts” future malpractice suits, even in the way that you describe. It just seems too difficult to pull off, both legally and logistically. How many doctors would agree to this in the first place? “We are going to videotape you and have the videos judged, and then we’ll follow you over the next decade to see whether you get sued. Mind you, you’re just our training set.” I imagine the first lawsuit might come from one of the doctors.

        Again, I could be flat-out wrong–but I suspect there is no Ambady study that predicts, on the basis of 30-second video clips, whether physicians will be sued in the future.

        • Diana,

          I gave the example of training and holdout data sets as just one example of how “prediction” is often used in statistics when there is no check with a future data set. The use is understandably misleading to those who are not familiar with it. When teaching statistics, I have tried to be careful to point out the difference between the technical and everyday use of “prediction,” as well as of other words that have both technical and everyday meanings. I’m afraid some people don’t give much thought to the possibility of confusing the two meanings, and consequently the two meanings easily become confounded in the learners’ minds.

          Some background comments:

          Prediction to the future is very difficult. (As Yogi Berra famously said: “It’s tough to make predictions, especially about the future.”) So we need to be aware of “degrees of credibility” of methods of trying to do so. Ideally, we would check a proposed model with future data, but that means waiting till the future. In many cases, we don’t have that luxury. (e.g., we can’t use tomorrow’s weather to give a prediction today for the day after tomorrow.) People have developed lots of methods (holdout/training sets is just one.) But there are also trade-offs in using one method rather than another. For example, building a model with a holdout set seems at first blush to be better than building a model not using a hold-out set to check it on. But if there is not much data, using a holdout set leaves too little data in the training set to build a good model. Statisticians have developed lots of methods of trying to build and check a model that might be the best one can do with the data at hand, but they they all have their strengths and weaknesses.

        • I should also add that things change — so that a model that might in fact be a good one for new data collected “now” might not be a good model for prediction to a month from now (think, e.g., of hurricanes, the economy, cultural changes, …)

          Also, I am not defending Cuddy — just trying to give some background on the general problem of prediction, as well as the standard (but confusing) use of the word.

        • Thank you, Martha, for all of this (including the Yogi Berra quote). I look forward to learning more about methods of building and checking a model and ascertaining degrees of credibility.

Leave a Reply

Your email address will not be published. Required fields are marked *