Skip to content
 

The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of Amy Cuddy’s claims about power pose)

121206124346_1_900x600

[Note to busy readers: If you’re sick of power pose, there’s still something of general interest in this post; scroll down to the section on the time-reversal heuristic. I really like that idea.]

Someone pointed me to this discussion on Facebook in which Amy Cuddy expresses displeasure with my recent criticism (with Kaiser Fung) of her claims regarding the “power pose” research of Cuddy, Carney, and Yap (see also this post from yesterday). Here’s Cuddy:

This is sickening and, ironically, such an extreme overreach. First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). So yes, I did respond to the peer-reviewed paper. The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture. I’ve been quiet and polite long enough.

There’s a difference between having your ideas challenged in constructive way, which is how it used in to be in academia, and attacked in a destructive way. My “popularity” is not relevant. I’m tired of being bullied, and yes, that’s what it is. If you could see what goes on behind the scenes, you’d be sickened.

I will respond here but first let me get a couple things out of the way:

1. Just about nobody likes to be criticized. As Kaiser and I noted in our article, Cuddy’s been getting lots of positive press but she’s had some serious criticisms too, and not just from us. Most notably, Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber published a paper last year in which they tried and failed to replicate the results of Cuddy, Carney, and Yap, concluding “we found no significant effect of power posing on hormonal levels or in any of the three behavioral tasks.” Shortly after, the respected psychology researchers Joe Simmons and Uri Simonsohn published on their blog an evaluation and literature review, writing that “either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it” and concluding:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

OK, so I get this. You work hard on your research, you find something statistically significant, you get it published in a top journal, you want to draw a line under it and move on. For outsiders to go and question your claim . . . that would be like someone arguing a call in last year’s Super Bowl. The game’s over, man! Time to move on.

So I see how Cuddy can find this criticism frustrating, especially given her success with the Ted talk, the CBS story, the book publication, and so forth.

2. Cuddy writes, “If you could see what goes on behind the scenes, you’d be sickened.” That might be so. I have no idea what goes on behind the scenes.

OK, now on to the discussion

The short story to me is that Cuddy, Carney, and Yap found statistical significance in a small sample, non-preregistered study with a flexible hypothesis (that is, a scientific hypothesis that posture could affect performance, which can map on to many many different data patterns). We already know to watch out for such claims, and in this case a large follow-up study by an outside team did not find a positive effect. Meanwhile, Simmons and Simonsohn analyzed some of the published literature on power pose and found it to be consistent with no effect.

At this point, a natural conclusion is that the existing study by Cuddy et al. was too noisy to reveal much of anything about whatever effects there might be of posture on performance.

This is not the only conclusion one might draw, though. Cuddy draws a different conclusion, which is that her study did find a real effect and that the replication by Ranehill et al. was done under different, less favorable conditions, for which the effect disappeared.

This could be. As Kaiser and I wrote, “This is not to say that the power pose effect can’t be real. It could be real and it could go in either direction.” We question on statistical grounds the strength of the evidence offered by Cuddy et al. And there is also the question of whether a lab result in this area, if it were real, would generalize to the real world.

What frustrates me is that Cuddy in all her responses doesn’t seem to even consider the possibility that the statistically significant pattern they found might mean nothing at all, that it might be an artifact of a noisy sample. It’s happened before: remember Daryl Bem? Remember Satoshi Kanazawa? Remember the ovulation-and-voting researchers? The embodied cognition experiment? The 50 shades of gray? It happens all the time! How can Cuddy be so sure it hasn’t happened to her? I’d say this even before the unsuccessful replication from Ranehill et al.

Response to some specific points

“Sickening,” huh? So, according to Cuddy, her publication is so strong it’s worth a book and promotion in NYT, NPR, CBS, TED, etc. But Ranehill et al.’s paper, that somehow has a lower status, I guess because it was published later? So it’s “sickening” for us to express doubt about Cuddy’s claim, but not “sickening” for her to question the relevance of the work by Ranehill et al.? And Simmons and Simonsohn’s blog, that’s no good because it’s a blog, not a peer reviewed publication. Where does this put Daryl Bem’s work on ESP or that “bible code” paper from a couple decades ago? Maybe we shouldn’t be criticizing them, either?

It’s not clear to me how Simmons, Simonsohn, and I are “bullying” Cuddy. Is it bullying to say that we aren’t convinced by her paper? Are Ranehill, Dreber, etc. “bullying” her too, by reporting a non-replication? Or is that not bullying because it’s in a peer-reviewed journal?

When a published researcher such as Cuddy equates “I don’t believe your claims” with “bullying,” that to me is a problem. And, yes, the popularity of Cuddy’s work is indeed relevant. There’s lots of shaky research that gets published every year and we don’t have time to look into all of it. But when something is so popular and is promoted so heavily, then, yes, it’s worth a look.

Also, Cuddy writes that “somehow people missed that they STILL replicated the effects on feelings of power.” But people did not miss this at all! Here’s Simmons and Simonsohn:

In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. The key point of the TED Talk, that power poses “can significantly change the outcomes of your life”, was not supported.

In any case, it’s amusing that someone who’s based an entire book on an experiment that was not successfully replicated is writing about “extreme overreach.” As I’ve written several times now, I’m open to the possibility that power pose works, but skepticism seems to me to be eminently reasonable, given the evidence currently available.

In the meantime, no, I don’t think that referring to a non-peer-reviewed blog is “the worst form of scientific overreach.” I plan to continue to read and refer to the blog of Simonsohn and his colleagues. I think they do careful work. I don’t agree with everything they write—but, then again, I don’t agree with everything that is published in Psychological Science, either. Simonsohn et al. explain their reasoning carefully and they give their sources.

I have no interest in getting into a fight with Amy Cuddy. She’s making a scientific claim and I don’t think the evidence is as strong as she’s claiming. I’m also interested in how certain media outlets take her claims on faith. That’s all. Nothing sickening, no extreme overreach, just a claim on my part that, once again, a researcher is being misled by the process in which statistical significance, followed by publication in a major journal, is taken as an assurance of truth.

The time-reversal heuristic

One helpful (I think) way to think about this episode is to turn things around. Suppose the Ranehill et al. experiment, with its null finding, had come first. A large study finding no effect. And then Cuddy et al. had run a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any power-pose effect is fragile.

From this point of view, what Cuddy et al.’s research has going for it is that (a) they found statistical significance, (b) their paper was published in a peer-reviewed journal, and (c) their paper came before, rather than after, the Ranehill et al. paper. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take Cuddy et al. as the starting point in our discussion, just because it was published first.

What next?

I don’t see any of this changing Cuddy’s mind. And I have no idea what Carney and Yap think of all this; they’re coauthors of the original paper but don’t seem to have come up much in the subsequent discussion. I certainly don’t think of Cuddy as any more of an authority on this topic than are Eva Ranehill, Anna Dreber, etc.

And I’m guessing it would take a lot to shake the certainty expressed on the matter by team TED. But maybe people will think twice when the next such study makes its way through the publicity mill?

And, for those of you who can’t get enough of power pose, I just learned that the journal Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose! Publication is expected in fall 2016. So you can expect some more blogging on this topic in a few months.

The potential power of self-help

What about the customers of power pose, the people who might buy Amy Cuddy’s book, follow its advice, and change their life? Maybe Cuddy’s advice is just fine, in which case I hope it helps lots of people. It’s perfectly reasonably to give solid, useful advice without any direct empirical backing. I give advice all the time without there being any scientific study behind it. I recommend to write this way, and teach that way, and make this and that sort of graphs, typically basing my advice on nothing but a bunch of stories. I’m not the best one to judge whether Cuddy’s advice will be useful for its intended audience. But if it, that’s great and I wish her book every success. The advice could be useful in any case. Even if power pose has null or even negative effects, the net effect of all the advice in the book, informed by Cuddy’s experiences teaching business students and so forth, could be positive.

As I wrote in a comment in yesterday’s thread, consider a slightly different claim: Before an interview you should act confident; you should fold in upon yourself and be coiled and powerful; you should be secure about yourself and be ready to spring into action. It would be easy to imagine an alternative world in which Cuddy et al. found an opposite effect and wrote all about the Power Pose, except that the Power Pose would be described not as an expansive posture but as coiled strength. We’d be hearing about how our best role model is not cartoon Wonder Woman but rather the Lean In of the modern corporate world. Etc. And, the funny thing is, that might be good advice too! As they say in chess, it’s important to have a plan. It’s not good to have no plan. It’s better to have some plan, any plan, especially if you’re willing to adapt that plan in light of events. So it could well be that either of these power pose books—Cuddy’s actual book, or the alternative book, giving the exact opposite posture advice, which might have been written had the data in the Cuddy, Carney, and Yap paper come out different—could be useful to readers.

So I want to separate three issues: (1) the general scientific claim that some manipulation of posture will have some effects, (2) the specific claim that the particular poses recommended by Cuddy et al. will have the specific effects claimed in their paper, and (3) possible social benefits from Cuddy’s Ted talk and book. Claim (1) is uncontroversial, claim (2) is suspect (both from the failed replication and from consideration of statistical noise in the original study), and item (3) is completely different issue entirely, which is why I wouldn’t want to argue with claims that the talk and the book have helped people.

P.P.S. You might also want to take a look at this post by Uri Simonsohn who goes into detail on a different example of a published and much-cited result from psychology that did not replicate. Long story short: forking paths mean that it’s possible to get statistical significance from noise, also mean that you can keep finding conformation by doing new studies and postulating new interactions to explain whatever you find. When an independent replication fails, it doesn’t necessarily mean that the original study found something and the replication didn’t; it can mean that the original study was capitalizing on noise. Again, consider the time-reversal heuristic: Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find a statistically significant interaction happened somewhere.

P.P.P.S. More here from Ranehill and Dreber. I don’t know if Cuddy would consider this as bullying. One hand, it’s a blog comment, so it’s not like it has been subject to the stringent peer review of Psych Science, PPNAS, etc, ha ha; on the other hand, Ranehill and Dreber do point to some published work:

Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al. , published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

From the standpoint of studying hormones and behavior, this is all interesting and potentially important. Or we can just think of this generically as some more forks in the path.

149 Comments

  1. Greg Francis says:

    I think Cuddy’s “sickening” reaction to these kinds of criticism are the result of confusing Carney, Cuddy & Yap (2010), which refers to a set of empirical findings and theoretical claims, with Dana Carney, Amy Cuddy and Andy Yap, who (as far as I know) are nice people trying to do good science. The confusion sometimes happens for the critics, the criticized, and (especially) for readers of the critiques.

    Cuddy seems to take the criticisms of Carney et al. (2010) as a personal attack on her scientific integrity rather than an expression of skepticism about her findings and claims. Although in my experience critics are often careful to focus on findings and claims rather than individuals, readers of critiques seem to often come away with an opinion that the critic personally attacked the original authors. In this particular case, I think Cuddy is behaving like these readers. As far as I can tell, all the discussions (including Andrew’s) have been about the ideas and evidence rather than about the scientists. Like Andrew though, I do not know what is happening behind the scenes, where the criticisms (perhaps from readers of the critiques) may be much more personal.

    We would all do well to remember to focus on the scientific merit of the empirical findings and the theoretical claims rather than on whether Profs. X, Y, and Z are bad scientists (which is an inference that is difficult to make from a single set of studies). If they really are poor scientists, then a focus on their evidence and claims will reveal it without personal accusations. I like that in these kinds of posts, Andrew often includes a statement reflecting something like the above ideas (“I have no interest in getting into a fight with Amy Cuddy….”)

  2. Kaiser says:

    I like the time reversal heuristic as a way to think about statistical replication. I have found that the whole forking-paths concept is not “obvious” and it is often hard to convince people who are trained in the do-one-experiment-and-find-significance school of thought. With the rise of A/B testing (better called large-scale online testing), people with little statistical training are running tests: I find that there is strong resistance to what we think are basic concepts like not running tests till significance, multiple testing, etc.

    What concerns me the most here are: (a) Cuddy’s implication that the replication of one of a set of metrics can be considered a successful replication, especially when the other metrics are not even close to replicating; (b) the idea that the replication needs to have precisely the same conditions as the original experiment – this implies that the finding is highly sensitive to original conditions, which does not give me confidence in the result.

    I wrote a little more about this here.

    • Keith O'Rourke says:

      I would agree measure of replication _should_ be time invariant.

      One suggestion would be measure how commonness of the direction a common prior is moved individually by each study towards its posterior based on just each study – that would be the direction of Posterior.i/Prior as a _vector_. With k multiple findings those _vectors_ would be of dimension k and moving most k in the same direction would be needed for a high measure of replication.

      The decision of when enough studies has been done to take the direction as being adequately determined (i.e. worth publishing a claim about) could be assessed separately.

      p.s. Kaiser: your shopping example seems good to me – I’ll steal it.

  3. Eric Loken says:

    So we as social scientists have accumulated capital from our research successes. There is a bigger process unfolding (replication failures, misunderstood p-values etc.) eroding the valuation of some of that accumulated wealth. That there is so much over leveraged intellectual property in social science makes everyone uneasy. And it is understandably very upsetting for an individual when they suffer a devaluation. It just feels personal, as if you alone are taking the hit for what should be a broader market correction. And it’s also understandably frustrating that it’s unclear how to fight back against that devaluation.

    I guess I’ve often felt bad for new researchers because they will have a more difficult hill to climb. But I guess they’ll also have the benefit, if they are careful, of building less leveraged wealth. Hopefully they won’t have to give back big gains as often.

  4. Hugh Sansom says:

    Came across this sweet and sour treat late, so I’m a little behind.

    Sometime in the 80s and 90s, people went on about power ties. I remember sitting around a computer (not the AI) ‘lab’ at the Sloan School of whatever it is they do there at MIT (before we all had our own, personal, just to ourselves computers) listening to business students debate the color of the _paper clips_ on their job applications. (Yes, really.) Anyone seen the opening minutes of “American Psycho” with Christian Bale? The moments when Bale begins to seethe as they Wall Street wannabe tycoons compare size … I mean business cards? Business cards. These are things that business baboons worry about. They’d like us to believe they are geniuses at driving the economy. But they’re really more like peacocks posing. Or baboons presenting. (Google zoologists or primatologists on “presenting”.)

    So power poses. Big fat hairy deal. To somebody.

    The interesting question to me: What is the actual return on posturing? It suited Einstein and Picasso very well. Consummate self-promoters. But they had a few other things going for them, too. (Think of “consummate” verb vs. “consummate” adjective, and we have a sense of where the ‘we’re really only doing what our primate cousins do’ thinkers are headed.)

    Which of the founders said of George Washington “He was bound to be the first president. He was always the tallest man in the room”?

  5. Njnnja says:

    Isn’t the quote “somehow people missed that they STILL replicated the effects on feelings of power” indicative of the entire problem? Was “feelings of power” the *one* thing that they claimed the power pose would improve? It sounds like the original author is hunting around the replication paper, looking around for p < .05, and saying "Hey, I've got something!" If that's what they are doing with someone else's data what should we think they did with their own?

    I would be concerned that even people in academia are mistaking something done in a university, with some kind of math, and published in a journal as being "real science," while ignoring the much more important hurdles of something done with an experimental set up that is capable of finding the evidence that one hopes to find, with a limited amount of flexibility for the experimenter to enhance noise into a "signal", and standing up to rigorous critiques.

  6. Phil says:

    Just a few things.

    (1) It does seem kind of ridiculous to have people hold any pose other than “lounging on the couch” for six minutes, as Cuddy says (presumably correctly) was done in one of the so-called “replications.” I have sometimes been puzzled by the fact that many attempted replications make substantial changes from the protocol they’re supposedly replicating. I recognize that even if you do try to perform a perfect replication, that’s not going to happen: there will be lots of things that weren’t published in the original paper that you’ll have to guess at; both the experimenters and the subjects will be different from the original set, perhaps in important ways; you’ll be doing the replication in a different building in a different season and subjects will be in different moods, and all sorts of things like that. But with something like “hold the pose for N minutes,” if the claim is that there is an effect when N=2, why on earth would you do N=6? (Unless you were _also_ doing N=2).

    (2) It’s funny=amusing that Cuddy puts such faith in peer review — “The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach” — considering she’s presumably as familiar with the peer review sausage factory as I am. At times I have gotten thoughtful, challenging reviews that have made me do a lot of improving work on a paper, by reviewers who were obviously diligent and thorough enough to serve as the gatekeepers that the peer review process intends them to be. And at times I have been that kind of reviewer myself. But frequently I have gotten reviews that were clearly provided by reviewers who hadn’t read the paper carefully (or, perhaps, hadn’t read it at all). I’ve done some pretty cursory reviews myself; more than once I’ve provided a review with the caveat “I did not check all of the equations.” And nearly no reviewer actually checks the results of a statistical analysis…indeed, you usually can’t, because the data aren’t provided. Putting it all together, I don’t make a huge distinction between “peer reviewed” literature, the “gray” literature of published (but not peer-reviewed) papers and reports, and other sources of information such as blogs, code repositories, wikis, etc. This is not to say I don’t make any distinction at all: given no other information, I’d certainly trust a paper from a peer-reviewed journal over a blog entry on some blog I’ve never heard of. But I usually do have other information: I’m familiar with the authors, or their institution; the blog isn’t just some random blog I’ve never heard of, but one that I follow or that has been recommended to me; and so on. In short (too late!) I find it funny that someone would object to anyone getting information from a “non-peer-reviewed blog” by someone they know and respect. What’s more, I’m going to go out on a limb and suggest that this objection wouldn’t have come up if the “non-peer-reviewed blog” had _supported_ the findings rather than contradicting them.

    (3) Greg Francis’s point about taking criticism personally is very apt. I’ll extend it to say that it’s also easy to make things personal in one’s response. If Cuddy has indeed been “quiet and polite” in her response up to now — I haven’t followed this so I have no idea — then at least the “polite” part of that is laudable (I don’t see why “quiet” is a good thing but I presume she means it metaphorically, that she hasn’t been raising the emotional intensity of the conversation.) All I can say to this is: Cuddy, if you’re reading this, I encourage you to _continue_ being quiet and polite. Saying people are “bullying” you because they express skepticism in the magnitude of an effect that you claim, and give reasons for their skepticism…that’s impolite.

    (4) Andrew, I have a suggestion for future posts on the theme of “garden of forking paths” papers and other questioning of whether claims of substantial effect are justified: use the authors’ names only when citing the research, and don’t make a point of it. Your Slate article, for example, seems to be more about Cuddy and about the claimed Power Pose effect. Cuddy says this, Cuddy says that, Cuddy gave a Ted talk, etc., etc. WTF, man? Instead of “Amy Cuddy’s famous finding is the latest example of scientific overreach”, how about “Recently hyped findings are the latest example of scientific overreach”, or similar? Instead of “Consider the case of Amy Cuddy”, how about “Consider the case of the Power Pose”? Why make it personal?

    • Bob says:

      Re: Points (3) and (4)

      Several years ago, I authored a piece that attacked an idea that had been put forth by several “experts in the field”. But, I wanted to avoid criticizing those individuals by name. I quoted several authors making such flawed assertions—in order to show that this idea had currency. But, I left the attribution in the footnotes and never put an author’s name in the text. It did not make sense to leave out the references—but I could at least make the reader work a little to discover the author.

      Similarly, I recently gave a conference presentation. My powerpoint displayed some deeply flawed assertions that I was rebutting. For most such assertions, I footnoted the source for each assertion on the slide where they appeared. But, my footnotes were in 8 point type and could not be read by a person with normal vision.

      So, if someone disputed the accuracy of my quotations, they could find the source and check them. But, the average member of the audience would not have been able to tie the assertions to a specific author.

      Now, given that “power posing” seems tied to one particular individual, it would be harder to employ this technique. Still, it might be worth trying.

      Bob

    • Andrew says:

      Phil:

      All good points (as I’d expect from a co-blogger).

      Regarding your point 1, I’ll ask around about the 6-minute thing and get back to you. Actually, I’m a pretty impatient person and I wouldn’t even want to sit for 2 minutes.

      Regarding your point 4, I agree in general, and as you perhaps recall, I talk about the himmicanes and hurricanes study, and the fat arms study, and the ovulation and voting study, without generally emphasizing their authors. Cuddy I’m not so sure of, because she, like Steven Levitt and Satoshi Kanazawa, is part of the story. In her Ted talk etc., she’s not just selling power pose, she’s selling Amy Cuddy, just as Freakonomics was not just selling Levitt’s research, it was selling Levitt’s persona. That’s fine, I have no problem with it, but then Cuddy is a big part of the power pose story.

      • Kaiser says:

        Phil has some good points. I don’t agree with Point 1 though. So the power pose works if it is held for 2 min but not when it is between 3 and 4 min but works if it is 5.2 min but not 6 min, etc. We can ask the researchers why they chose 6 minutes but like I said above, if the effect is sensitive to things like that, I wonder if we are observing something real.

        • Phil says:

          I wonder if we’re observing something real too, and I think if someone wants to try 2 minutes and 4 minutes and 6 minutes that’s fine. But to _just_ do 6 minutes, and not do 2 minutes or anything close to it…that’s not a replication.

          There was recently a story in the New Yorker about some oncologists who found an effective treatment for, um, it might have been non-Hodgkins lymphoma, I forget (but you can look it up) back in the 1960s. They reasoned that they needed to use several drugs in sequence to avoid selecting for resistant cells, and that they needed to cycle through them several times in succession at essentially the highest doses and shortest recovery periods the patients could tolerate. It might take years to get such a study approved today, but in those freer times they were able to just go ahead and try it, according to the article. So they did…and it worked! Complete cures in some patients, long-term remission in others…huge, huge effects. They published their findings and gave some talks. Some time later, one of them visited some famous hospital in New York and discovered they weren’t using the protocol, and what they were doing instead — the old standard treatment — was as ineffective as ever: their patients were dying. The oncologist asked “why aren’t you using our protocol?” and was told “we tried it and it didn’t work.” He couldn’t believe it so he asked for more information…and it turned out that the doctors who tried it didn’t like one of the four drugs, so they swapped it out for another one. And they didn’t want the patients to be so uncomfortable so they decreased the dosages and increased the recovery time between cycles.

          The Power Pose thing sounds like claptrap to me, but the way to test whether the claimed effect is real is to try the same thing Cuddy et al. tried, not to try something very very different. And from the standpoint of holding a pose, 6 minutes is very very different from 2 minutes. Perhaps more to the point, why NOT try 2 minutes? What is to be gained by doing a different time?

  7. Steen says:

    Your follow-up posts are the best!
    This one was timely for me—I clicked on a ‘Presence’ ad on npr.org by accident. Good thing I had been primed to be skeptical of the effect size!

  8. Stuart Buck says:

    As impolite as this might be, perhaps Andrew should also look at the “preliminary research” that Cuddy cites in her recent New York Times op-ed (nytimes.com/2015/12/13/opinion/sunday/your-iphone-is-ruining-your-posture-and-your-mood.html):

    “How else might iHunching influence our feelings and behaviors? My colleague Maarten W. Bos and I have done preliminary research on this. We randomly assigned participants to interact for five minutes with one of four electronic devices that varied in size: a smartphone, a tablet, a laptop and a desktop computer. We then looked at how long subjects would wait to ask the experimenter whether they could leave, after the study had clearly concluded. We found that the size of the device significantly affected whether subjects felt comfortable seeking out the experimenter, suggesting that the slouchy, collapsed position we take when using our phones actually makes us less assertive — less likely to stand up for ourselves when the situation calls for it. In fact, there appears to be a linear relationship between the size of your device and the extent to which it affects you: the smaller the device, the more you must contract your body to use it, and the more shrunken and inward your posture, the more submissive you are likely to become. Ironically, while many of us spend hours every day using small mobile devices to increase our productivity and efficiency, interacting with these objects, even for short periods of time, might do just the opposite, reducing our assertiveness and undermining our productivity.”

    Hmmm. The study (see dash.harvard.edu/bitstream/handle/1/10646419/13-097.pdf?sequence=1 ) involved giving 75 people a bunch of tasks/surveys on an iPod Touch, an iPad, a MacBook, or a desktop (4 treatment groups). After the tasks/surveys were done, the experimenter would leave the room and promise to be back in 5 minutes, but wouldn’t, in fact, return. The question was how long the 18-20 people in each treatment group would wait to leave the room and find the experimenter. The finding was that the smaller the device, the longer people would wait to leave. Cuddy and her co-author take this as evidence that with smaller devices, people are more hunched over, and are therefore less assertive.

    Hmm. This finding could merely be because people liked or were more familiar with the smaller devices, or because they found smaller devices more distracting, or something else. It seems quite a leap to assume that a more hunched posture from smaller devices (note: posture wasn’t directly measured, as far as I can tell) made people more submissive.

    • Shravan says:

      ” In fact, there appears to be a linear relationship between the size of your device and the extent to which it affects you: the smaller the device, the more you must contract your body to use it, and the more shrunken and inward your posture, the more submissive you are likely to become.”

      Actually I have myself noticed this linear relationship between the size of my device and the extent to which it affects me, but not in an RCT.

    • psyoskeptic says:

      It must be a real effect because it’s significant with less than 20 subjects per condition. Therefore, it’s very large.

      Further, one needn’t measure the actual hunching. That obviously must have happened. And please ignore our irrelevant gambling tasks.

    • Rob says:

      Another theory: the tendency to leave a room varies in inverse proportion to how easy it is for someone to steal a device with which you’ve been entrusted. If I take a phone with me I’m worried about someone thinking I’m trying to abscond with it; if I leave it in an empty room it might disappear before I get back. Both dangers are smaller with a laptop.

      Not saying that explains the statistics; I do suggest my model is at least as plausible as Cuddy’s.

  9. There is always another caveat or confounder round the corner. If you really want to get rid of a significant result you will succeed.

    When you do empirical research you need to have a theory first. I scanned the replication study and it reads as a purely empirical exercise. I miss a discussion of why no effect was found while the theory of the original study says otherwise. You either inherit that original theory and own it or come up with your own theory. Expecting someone else is wrong is not a theory.

    Surely as a layman I think this power pose thing is fishy. But I am underwhelmed if this is the poster child for replication.

    • Andrew says:

      Martin:

      As Simmons and Simonsohn explain, no special theory is needed to explain a statistically significant comparison in the first experiment and no such pattern in the replication. These results are entirely consistent with zero effect and the garden of forking paths.

      This is not to say that the effect is zero, just that the results from an initial statistical significance in a non-preregistered study, followed by a null result in a replication, are consistent with a null effect. Hence no theory needed, except the quite reasonable theory that how you sit for a couple minutes will have minimal effects on hormone levels.

      • Anonymous says:

        Don’t you take issue with this statement by Simmons and Simonsohn: “While the simplest explanation is that all studied effects are zero”?

        Since when are exact-zero effects the “simplest explanation”? A model with an exactly zero effect is highly specific and not “simple” at all by any defensible definition of simple. The statement confusingly reports the correct conclusion of “consistent with zero effect model for the sake of making an implicit normative claim (one should believe this model because it’s the simplest one).

        • Andrew says:

          Anon:

          I agree with Simmons and Simonsohn that the exact-zero story is the simplest. I agree with you that exactly zero effects are not plausible. That’s why I prefer to say something like, The effects appear to be too small and too variable to estimate in this way.

          • Anoneuoid says:

            How about the following for the usual “compare two groups” type of study.

            Scenario A: Narrow interval that includes zero
            Scenario B: Wide interval that includes zero

            Given the methodology, data, and statistical model used
            A) this evidence is inconsistent with theories predicting large positive or negative values and the factors involved are expected to be relatively unimportant to any decision making process.

            B) this evidence has little ability to distinguish between different theories and it remains uncertain what influence the factors involved should have on any decisions making process.

  10. Anon says:

    Andrew, you might be interested in this Edge interview with psychologist Richard Nisbett.

    http://edge.org/conversation/richard_nisbett-the-crusade-against-multiple-regression-analysis

    “If that failed to replicate, it wouldn’t be surprising and it wouldn’t shake your faith in the idea that minimal cues, which ought to have no part in determining your behavior, do have an impact.”

    • Erikson Kaszubowski says:

      What a strange interview.

      I find it quite surprising that actual scientists might read a paper that draws causal conclusions from observational data using only multiple regression and accept the conclusions without a question. It is also surprising that someone like Nisbett talks about multiple regression as if the tool in itself or the observational data on which it is applied are the problem. He must be aware, I’m sure, of techniques such as propensity score matching, that allows causal inferences from observational data, sometime using multiple regression.

      But the most strange point in his interview is that he seems to imply that causal inferences from experimental data is iron-clad and always informative. As if those frail results from social psychology, that replicate under this-and-that unknown circustances and not under another, also unknown, circustances, were somehow advancing our knowledge about cognitive phenomena and should not be questioned when they do not replicate. We have this body of hard-to-replicate effects that somehow points how our behavior might be driven by unimportant cues, but this causal effect appears in some context and not others, and we can’t make any sense about it! Or we can search for “moderators” and claim to have found something new when p < 0.05 in every new experiment, adding more noise to the heap.

      It's the worst kind of Lakatosian Defense for a theory that has so little in the bank, quoting Meehl. Social Psychologists have a hunch-theory that apparently unimportant cues migh affect behavior. It's more a hunch than a theory, because we can't really deduct any substantive hypothesis from it. Then, an experiment find that some kind of priming has a positive effect on some behavior. Great, it seems that our hunch is not completly wrong! Then, a replication does not find the same effect. But wait, if we condition on this other covariate, we find a similar effect. Or we don't, but it only means that contextual cues are so contextual that the effect varies a lot, and our "theory" is more strongly corroborated.

  11. James says:

    Wow. I’ve found it. The saddest place on the internet.

    • Andrew says:

      James:

      Been saving up that zinger for a long time, huh?

      • James says:

        If a disgruntled researcher trusts his spidey sense over empirical evidence and the scientific method but no-one is there to read it, does it even make a sound? I can keep going. . .

        • James says:

          Also, does it seem a bit creepy to anyone else that Andrew is blogging point by point about Cuddy’s post on her friends Facebook page? Because it does seem a little off color to me.

          • Andrew says:

            James:

            I’m not actually on facebook. As I wrote in my post above, someone emailed me the link. It doesn’t seem creepy at all for someone to respond to criticism. I don’t think it’s creepy for Cuddy to have responded to the criticism of Simmons and Simonsohn, I don’t think it’s creepy for Cuddy to have responded on facebook to my criticism of her claims, and I don’t think it’s creepy for me to respond on a blog to Cuddy’s criticisms. Nor for that matter to I think it’s creepy that you’re commenting here.

            Criticism and give-and-take, that’s central to science. Indeed, I’m responding to you here because communication is important to me, so I like to clarify confusing points. When one commenter raises a question, it could be that many readers have a similar question, hence this effort.

            Finally, you can feel free to trust your “spidey sense” over the empirical evidence presented by Ranehill et al. and analyzed by Simmons and Simonsohn. I don’t have any spidey sense to bring to this particular problem so I’ll just have to go with the empirical evidence and the associated statistical arguments. (The statistics are necessary in this case to generalize from sample to population, as it is the population that is of interest.)

            • James says:

              But does it make a sound Andrew?

              • Rob says:

                I take it as a sign of Andrew’s over-generous nature, and tendency to err on the side of encouraging opposing viewpoints, that the above comment was allowed through moderation. This kind of hateful trolling has no place in a reasoned discussion.

                (This comment doesn’t really progress any useful discussion either, so I will take no offense if it is excised in moderation.)

          • Brad Stiritz says:

            Andrew’s post doesn’t seem creepy or off-color to me in the slightest. I would characterize it as dead-on and pitch-perfect.

          • Rahul says:

            Hey, if you don’t want others to read it, use email. Not FB posts.

            Besides, FB offers you a ton of ways to restrict who can see what you post.

            If Andrew can read a comment, its fair game to be allowed to respond.

    • dl says:

      That’s impressive. It can’t be easy typing while holding a power pose.

  12. Felipe says:

    I have enjoyed the slate piece and this discussion very much.

    I wouldn’t be surprised if there weren’t a power pose effect at all (or if it were too fragile to be interesting). However, bringing in considerations about publication incentives, I have worries about how to interpret the epistemic import of one (1) non-replication. After all, independent replicators of popular findings can be expected to be biased for the null. I.e., if their baseline belief about the original finding was not skepticism, they wouldn’t even be attempting a replication study in the first place. Am I missing something?

    Glad to read that there will be a multi-site preregistered set of studies on the issue.

    • James says:

      Assuming that Ranehill passed 9th grade science, she knows how to follow an existing method. Why then did they not follow the published method if not in the hope to perform a non-replication? The incentive system certainly rewards non-replication over replication so it surprises me that these methodological variations aren’t viewed with more suspicion. It seems to me that much of this blog-land analysis is more emotionally driven than logically driven. Perhaps thats why none of this has passed the peer review standard yet.

      • Rahul says:

        Fair enough. I concede that an exact replication would be good. I hope somebody steps up and does it.

        • James says:

          It would indeed. Why do you think they didn’t follow the method in the first place. Indeed they are getting more attention for non replicating then they would if they had replicated. There’s been a lot of fussing and statistical trickery and nobody has even attempted to replicate.

          • Anoneuoid says:

            This failure to perform an actual direct replication happens all the time. It is bizarre from scientific perspective, I’m not sure there are more rewards for a “non-replication”, rather it gives everyone an out and need for more funding. “Oh it must have been factor x”.

            “It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and chal-lenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological net-work, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once re-futing or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and recon-structive efforts of Carnap, Hempel, and Popper to unscramble the logical relation-ships of theories and hypotheses to evidence. Meanwhile our eager-beaver re- searcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modem statistical hypothesis-testing, has produced a long publica- tion list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.”
            http://cerco.ups-tlse.fr/pdf0609/Meehl_1967.pdf

    • Kaedor says:

      You are indeed missing something: Ranehill et al. were looking to “replicate and extend” the findings, i.e. exploring moderators of power pose effects, while replicating the main effects along the way.

      • Clyde Schechter says:

        “exploring moderators of power pose effects, while replicating the main effects along the way.”

        But no, you can’t explore duration as a moderator of the power pose effect when you use only one duration. Had they used 2 and 6 seconds, and estimated the difference in effect, that would be a moderation analysis. Using only 6 seconds is really just a different experiment altogether. No moderation exploration. No replication either.

      • James says:

        Do any of the esteemed professors on this blog see any problem with dismissing the results of one experiment with an experiment with a different manipulation? If not then this is an even sadder situation than I thought.

        • Rahul says:

          One experiment? Doesn’t the Simonsohn analysis cover dozens of papers?

          • James says:

            There are two points. A Ranehill experiment that does not follow the method and is therefore not a replication. And a blog post that is not peer reviewed and none of you actually understand. For now let’s focus on how an experiment with a different method is not able to say that the result of the first experiment is invalid.

  13. Brad Stiritz says:

    Thank you Andrew for a great post with many important teachings!

    >While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve)

    Could someone please explain how this “right-skewed p-curve” would be generated in principal, or please link to a basic reference? “P-curve” means a curve of p-values, right?

    It sounds like this curve of p-values should be retrospective (“all studied effects”), but then how are multiple p-values generated? Via some kind of bootstrap process? Is there any standard terminology for any of the specifics here?

    Thanks in advance for any assistance.

  14. Rahul says:

    Off topic, but there’s already a ton of warning flags here screaming “Bad Science Ahead”:

    (1) Professor of Business Administration. (2) Author has a BA + MA + PhD all in Social Psychology. (3) Work published in “Psychological Science” (4) No pre-registration done (5) All co-authors are from a Business School

    Admittedly, none of these factors individually are a clincher, but taken together my priors would be extremely skeptical that I’m going to encounter any good Science ahead.

    • Slutsky says:

      With all due respect: I do not think it is fair nor necessary to make such a broad statement. We do not have any evidence on whether science from individuals with these characteristics is any better or worse than research in other disciplines. To make inferences on the quality of research in entire (very large and diverse) domains from problems with individual cases does not seem appropriate to me. With this comment, I do not want to make any judgement regarding the study which is discussed here, I just think it is best to focus on the characteristics of the paper.

      Disclaimer: I am at a business school, and I know many colleagues that share the characteristics described above that do serious and excellent research.

      • Rahul says:

        We all have our different priors I guess. I only stated mine.

        e.g. The editors of Psych. Sci. may not think it fair for them to become notorious for publishing non-replicable work.

        In any case, I wasn’t picking on one factor, but a combination. And the combination I outlined above, in my opinion, is one that would make me very leery.

  15. Rahul says:

    Cuddy seems to use salivary testosterone tests. Is this well correlated with blood sampled testestorone? From what I know doctors are somewhat skeptical about salivary tests of Testosterone when using them for treatments.

    Here’s one abstract:

    In a series of studies, we identify several specific issues that can limit the value of integrating salivary testosterone in biosocial research. Salivary testosterone measurements can be substantially influenced during the process of sample collection, are susceptible to interference effects caused by the leakage of blood (plasma) into saliva, and are sensitive to storage conditions when samples have been archived.

    http://www.ncbi.nlm.nih.gov/pubmed/15288702

    Is there a danger of noise swamping the signal? She doesn’t test on-site. The samples are frozen and dispatched to a remote lab after being frozen for varying periods (“upto two weeks”).

  16. Shravan says:

    The line of work that Amy Cuddy is promoting is a nice example of the power of hype. It shows that you really can get very far with an idea that is (a) simple (it takes no intellectual effort or technical background knowledge to understand it), (b) easy to apply in real life (no real effort is needed). Just raise your hands up high and you’re done.

    If someone were to give a Ted talk saying that you have to put in years and years of dedicated effort to become good enough to be successful in a job interview in some high skill area, you would not get 30 million hits.

    Regarding Andrew’s point, about Cuddy not even being willing to consider that her hypothesis might be unsupported by data even in principle, maybe Andrew needs to send her a copy of Superforecasting. There, the author lists the characteristics you need to have in order to be a good forecaster, which is basically what a scientist is supposed to be. Ability to question one’s own beliefs is up there high on the list.

    • Andrew says:

      Shravan:

      Regarding your second paragraph: Yeah, you’re not kidding. I just googled *ted 10000 hours* and found a bunch of links to a Ted talk by Josh Kaufman who, according to one of the links says It doesn’t take 10,000 hours to learn a new skill. It takes 20:

      The speaker, Josh Kaufman, author of The Personal MBA, explains that according to his research, the infamous “10,000 hours to learn anything” is in fact, untrue.

      It takes 10,000 hours to become an “expert in an ultra competitive field” but to go from “knowing nothing to being pretty good”, actually takes 20 hours. The equivalent of 45 minutes a day for a month. . . .

      • Shravan says:

        I have seen that 20 hours talk. I think he demonstrated his ability to learn the ukulele in 20 hours in that talk. I guess it all depends on how you define “learn”.

      • Keith O'Rourke says:

        Though perhaps true for “to go from “knowing nothing to being pretty good at not being caught by most as still knowing almost nothing”, actually takes 20 hours.

        Maybe not even 20.

        Some of my past endeavors:

        Learn to operate and demonstrate the safe use of industrial sandblasting equipment just from the product documentation (almost got caught in my first demo by not previously experiencing the push back from the nozzle).

        Setup and test a compressed air station onsite at a gas plant in the Yukon (what was new was doing this on site without the help of the service manager – almost got caught never having flown before).

        My first statistical consult at U of T three weeks before starting Msc biostats program as other students and faculty were away on vacation (saved by the very concise Analysis of Cross-Classified Categorical Data by Stephen Fienberg).

      • Andrew says:

        Shravan, Keith:

        Indeed, I’m not saying Kaufman is wrong. I’d love ot be able to play the ukulele just a little bit, so maybe I’ll give it a try. I was just supporting Shravan’s earlier point that simple-and-easy ideas are popular!

  17. Andrew says:

    To all the commenters focusing on the differences between the two experiments:

    1. The Cuddy et al. experiment differs from the Ranehill et al. experiment by being a shorter treatment whose purpose is hidden from the participants. It’s possible there’s a large effect under the Cuddy et al. treatment and a small or zero effect under the Ranehill et al. treatment, but to make that claim you’re really threading the needle, and at the very least this calls into question the general claims made for power pose in practice.

    2. Don’t forget the time-reversal heuristic which is the topic of the above post. If you start with the null finding of Ranehill et al., then follow up with a smaller study with a weaker treatment that is not blind to the experimenter and, as usual, is subject to the garden of forking paths: this does not add up to strong evidence in favor of their being a large and consistent effect. My point is that in the usual discussion the power pose effect benefits from a sort of incumbency advantage: this becomes clearer once you take that away via the time-reversal heuristic.

    3. Finally, if you go back to my Slate article and my post here the other day: My point of writing all this was not to shoot down the claims of strong evidence for power pose—Simmons and Simonsohn did that just fine already, and indeed I reported on the Ranehill et al. experiment several months ago—but rather to point out the disconnect between the outsiders who see the Ted talk or the Harvard Business School press release or the airport book or the p less than .05 and think it’s all proven by science, and the insiders who have a much more skeptical take. The insiders aren’t always right, but in any case this difference in perspective is interesting to me. And I think that at some point even NPR and Ted will catch on, and realize that if they blindly promote this sort of study, they’ll be on the outside looking in.

    • Rahul says:

      If the Cuddy “power poses” advice really works, it’d be interesting trying to get people to do it without them knowing why they are doing it. Does she have any thoughts on this, I wonder.

      Apparently if the subjects know, the trick doesn’t work, eh?

      • Shravan says:

        I suspect it helps in the sense that it probably helps the candidate do a bit better and look and feel more confident. From a practical perspective, the question for me is, how much does it help relative to other factors? Competence, for example. Probably not much. If I am going to spend my time preparing for the interview of my lifetime, I would probably be better off spending my lifetime preparing for the interview, rather than admiring my underarms in the mirror for the recommended number of seconds.

  18. Bob says:

    “Don’t forget the time-reversal heuristic which is the topic of the above post.”

    The time-reversal heuristic is a clever way to think about such issues. I’ll try to keep it in mind.

    Of course, a Bayesian analysis gives us
    P(X) ~ Prior(X)*Likelihood(X|first experiment)*Likelihood(X|second experiment).

    Order of the experiments does not matter—the product of the likelihoods is the same. It seems to me that the time-reversal heuristic is a good way to remind ourselves that the first experiment does not deserve extra weight just because it was first.

    Bob.

  19. Anna Dreber says:

    Hi everyone,

    We (Ranehill et al.) started our study with the aim of replicating AND extending the results of Carney, Cuddy and Yap (we have three behavioral tasks, not one as in Carney et al., in our study). Therefore, we made a few changes to the experimental protocol that seemed, for our purposes, improvements to the initial study. We have, therefore, always acknowledged our study as a conceptual replication, not an exact replication.

    Our online supplemental material gives full information on all deviations between our design and those of the original study (downloadable here http://pss.sagepub.com/content/26/5/653/suppl/DC1). We also wrote more about the differences between our study and Carney et al. in the post on datacolada (http://datacolada.org/wp-content/uploads/2015/05/Data-Colada-reply-v3.pdf) where we discuss the likelihood of potential moderators suggested by Carney et al for explaining the difference in results.

    With respect to the time spend in the poses we had participants hold the poses for 3 min each, instead of 1 min each as in Carney et al. We asked the participants whether they found the poses uncomfortable or not, and in the supplementary material of the paper we analyzed the data separately for those who reported the poses to be comfortable and this does not change our results. In particular, we write the following in the supplementary material:

    “A referee also pointed out that the prolonged posing time could cause participants to be uncomfortable, and this may counteract the effect of power posing. We therefore reanalyzed our data using responses to a post-experiment questionnaire completed by 159 participants. The questionnaire asked participants to rate the degree of comfort they experienced while holding the positions on a four-point scale from “not at all” (1) to “very” (4) comfortable. The responses tended toward the middle of the scale and did not differ by High- or Low-power condition (average responses were 2.38 for the participants in the Low-power condition and 2.35 for the participants in the High-power condition; mean difference = -0.025, CI(-0.272, 0.221); t(159) = -0.204, p = 0.839; Cohen’s d = -0.032). We reran our main analysis, excluding those participants who were “not at all” comfortable (1) and also excluding those who were “not at all” (1) or “somewhat” comfortable (2). Neither sample restriction changes the results in a substantive way (Excluding participants who reported a score of 1 gives Risk (Gain): Mean difference = -.033, CI (-.100, 0.034); t(136) = -0.973, p = 0.333; Cohen’s d = -0.166; Testosterone Change: Mean difference = -4.728, CI(-11.229, 1.773); t(134) = -1.438, p = .153; Cohen’s d = -0.247; Cortisol: Mean difference = -0.024, CI (-.088, 0.040); t(134) = -0.737, p = 0.463; Cohen’s d = -0.126. Excluding participants who reported a score of 1 or 2 gives Risk (Gain): Mean difference = -.105, CI (-0.332, 0.122); t(68) = -0.922, p = 0.360; Cohen’s d = -0.222; Testosterone Change: Mean difference = -5.503, CI(-16.536, 5.530); t(66) = -0.996, p = .323; Cohen’s d = -0.243; Cortisol: Mean difference = -0.045, CI (-0.144, 0.053); t(66) = -0.921, p = 0.360; Cohen’s d = -0.225). Thus, including only those participants who report having been “quite comfortable” (3), or “very comfortable” (4) does not change our results.”

    We believe that what is likely the most important departure in our study from Carney, et al., is that we employed an experimenter-blind design, in which the researcher conducting the behavioral (risk and competitiveness) tasks did not know whether the participant had been assigned to an earlier high- or low-power pose. Of all the differences between the two designs, we believe this one seems the most likely to play a key role in explaining the difference in results.

    Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al. , published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

    Thanks everyone for a great discussion, and thanks Andrew for giving this so much attention.

    Eva Ranehill and Anna Dreber

    • Rahul says:

      Just curious, how much did it cost to run your replication? Would it be feasible for you, to run another variant where you replicate Cuddy’s design *exactly*? e.g. Cuddy also seems to question the fact that in your replication the subjects knew what the goal of the study was.

      Not that I think Cuddy’s effect is real, but if that’s what it’d take for everyone to be satisfied might be worth it?

      Especially, given how much impact her hypothesis has had surely someone must be interested in funding an exact replication? And given your one replication already, you seem best placed to run another!

      Just my 2 cents.

      • Andrew says:

        Rahul:

        As noted in my post, Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose. So you will get your wish, or something like it.

        • Eva Ranehill says:

          Rahul:

          Depending on the content of the coming special issue we are considering running a direct replication.

          The cost for our study was about USD 15000, but the real cost is time since participants take part in the study one at the time.

          • Rahul says:

            A different question:

            Is there previous literature on how reliable salivary testosterone measurements are? I assume a blood serum measurement is the gold standard.

            One suggestion: If you run another replication, do you think it would be feasible to do a blood draw and serum testosterone assay too?

            I’m wondering if what they are chasing based on the salivary measurements is mostly noise?

  20. Johannes says:

    Fun Fact:
    Searching for the most viewed TED talks puts Cuddy second on the list. Adding the filter topics=science makes it disappear.

  21. mark says:

    I’ve always been struck by the low sample size of the original Carney et al. paper. Who sets out planning a study on a phenomenon that they must have expected to have a weak effect and decides that an N of 42 would be appropriate? If one of my graduate students came to me with that research plan we would have a very long and uncomfortable chat.

    It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

    Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not <.05.

    Am I missing something??

    • mark says:

      Just realized (duh!) that this would undermine one of their central claims. I probably flubbed a simple chi-square calculation so will probably eat crow in a few minutes.

      • jrc says:

        How would the difference between p=.04945 and p=.052 change anything about our inferences?

        a) p=.05 is not magic, and being one side or the other should not change any (sane) persons conclusions.

        b) the world is still a world where small manipulations of individuals produce small and varying (in sign and magnitude) effects across different people and it isn’t crazy to think that power poses will affect people in various ways

        c) it would still be a poorly designed, under-powered, noise-chasing type of “experiment” that doesn’t teach us much of interest about human beings or the technology of self-improvement

        • mark says:

          I agree with you completely that the difference is meaningless but I doubt that the editors and many of the reviewers of the paper would see it the same way. The perception (accurate or not) that you have to get your p-value to below this stupid threshold in order to have something meaningful is a belief that the authors appear to hold. A misreporting of this kind also suggests the possibility of some fudging of the results which should be distasteful to anyone.

          • Nick says:

            I get the same chi-square numbers as you. Not good.

            Also, check out this ANOVA:
            >>Finally, high-power posers reported feeling significantly more “powerful” and “in charge” (M = 2.57, SD = 0.81)
            >>than low-power posers did (M = 1.83, SD = 0.81), F(1, 41) = 9.53, p < .01; r = .44.
            If you make the most optimistic possible unrounded figures for this (Mhigh=2.574999, Mlow=1.825, both SDs=0.805), you get F=9.0933, not F=9.53. (You can also do a t-test and square the t value). The p value is still significant and <.01, but if this result can be confirmed, it's still sloppy reporting. (And it seems to me that the DFs should be (1,40) and not (1,41).)

            Oh, and while we're on the subject of sloppy reporting, have a look at 10.1111/j.1540-4560.2005.00405.x and calculate the t statistics and associated p values.

            • Andrew says:

              Nick:

              Good catch. I went to the relevant section of the paper, the Results section, and I see this:

              We created a composite score of warmth by averaging the three warmth items, α = .81. A one-way ANOVA revealed the predicted main effect on this score, F(2, 52) = 3.93, p < .03, such that participants rated the high-incompetence elderly person as warmer (M = 7.47, SD = .73) than the low-incompetence (M = 6.85, SD = 1.28) and control (M = 6.59, SD = .87) elderly targets. Paired comparisons supported these findings, that the high-incompetence elderly person was rated as warmer than both the low-incompetence and control elderly targets, t(35) = 5.03 and t(34) = 11.14, respectively, both ps <.01. In addition, reflecting the persistence of the stereotype of elderly people as incompetent, participants saw targets as equally (in)competent in all conditions, F(2, 52) = 1.32, n.s.

              I can’t quite follow what they’re doing here. It says N=55. It’s not clear to me whether it’s a between or within-person experiment, but given that they are presenting means and sd’s separately, I assume it’s between-person. But then don’t you need to know the N’s for each of the 3 conditions? If we assume N=18, 18, 19, then the correct t statistics are (7.47 – 6.85)/sqrt(.73^2/18 + 1.28^2/18) = 1.79 and (7.47 – 6.59)/sqrt(.73^2/18 + .87^2/19) = 3.34, respectively.

              Maybe there’s something we’re missing in how the data were analyzed? I’d guess that someone just slipped up when typing numbers into a calculator, but who knows? A t statistic of 11.14 should’ve rung some warning bells, that’s for sure!

              The thing that really worries me about this particular paper is the garden of forking paths. From the study design:

              This design allows us to explore how different levels of stereotype-consistent behavior mightimpact not only ratings of competence, but also ratings on the other key dimension, warmth.

              As noted above, there were three experimental conditions and then at least four outcomes:

              Participants then indicated how much George possessed each of a series of personality traits, on a scale from 1 (not at all) to 9 (very). We included one manipulation check (competence) and three items designed to assess warmth (warm, friendly, and good-natured).

              They then reported results on the average of the three warmth items and separately on the competence items.

              Lots of multiple comparisons problems here. Just considering what’s in the paper already: they report two F tests, one of which is significant. They report two t tests, one of which is significant. They come up with an elaborate explanation (“an oppositional stereotype enhancement effect”) based on a comparison that does not happen to be statistically significant (assuming that we calculated the t statistics correctly here) and then they take non-significant comparisons (with this tiny N=55 study) as indicating no effect.

              There are also various degrees of freedom that were sitting there, not reported in the paper and perhaps not even examined by the researchers, but which would’ve been natural to examine if nothing statistically significant had been found in the above comparison. There were 3 different warmth items, also whatever information they had on the participants in the experiment (perhaps sex or field of study). Even if all that was available were the 4 survey items, that’s still enough for lots of forking paths.

              Also this: “As always, additional studies arein order and should manipulate warmth and competence of other groups withevaluatively-mixed stereotypes, including those from the competent, cold cluster.” This is the unfortunately common step of extrapolating some general truth from whatever happened to turn up statistically significant (or, in this case, not statistically significant, given the calculation error in the t statistic) in some particular sample, and then planning to move forward, never checking to see if these exciting counterintuitive findings could be nothing but noise.

              And, get this: according to Google Scholar, that paper has been cited 470 times! Imre Lakatos is spinning in his grave. Paul Meehl too. They’re both spinning. I’m spinning too, and I’m still alive.

              How did you happen to notice this?

  22. BG says:

    It turns out that someone did actually do an exact replication of the Cuddy study. The results didn’t replicate: http://psychfiledrawer.org/replication.php?attempt=MjI2

    • Daniel says:

      That wasn’t an exact replication. They mention in the link you posted:

      The poses and procedure for collecting the saliva samples were identical to the original study. However, a facial emotion task was included in this study. Although the task did not change the amount of time between the pre- and post-saliva tests, it did increase the amount of time a pose was maintained from 2-3 minutes to 10-12 minutes.

  23. Shravan says:

    In the original TED talk that got so many hits, Cuddy removed the uncertainty estimates from the comparisons of means. The error bars were in the paper (Fig 3), but she went to the extra effort of removing the error bars in the talk. I guess she didn’t want to fluster her audience with uncertainty estimates.

    Also, when I had read the paper some time back, it was clear they had monkeyed around with subject removal and so on. I found it really odd that their key effect was reported as p<0.05 but that the correct rounding up to two digits would have been 0.05, not less than 0.05.

    They wrote:

    "As hypothesized, high-power poses caused an increase in testosterone compared with
    low-power poses, which caused a decrease in testosterone, F(1, 39) = 4.29, p < .05;…"

    R delivers a p-value of 0.045. If I round to two digits, it would be 0.05. It seems like a trivial detail, but it would have been so much less impressive to say p=0.05 (accompanied by a silent "phew!"). That's the difference between 30 million hits and nothing, folks.

    I also find it really depressing that Cuddy describes herself as a "public figure" on her facebook page (where I went to look for her offending post about Andrew being a bully). When scientists start to see themselves as celebrities first, it's time to quit and move on. I sometimes feel that Entsophy's (or is it one of the Anonymous guys now?) aggressive and out-of-control ravings against professors has not been entirely without merit.

    Related:
    One of my students recently commented (maybe he got it from somewhere) that the key test of whether you understand the p-value is whether you are still using it. If you are using it, it is almost certain that you don't understand it. I want to use that quote in a journal article some day.

    • Rahul says:

      I think the rest of you Professors must share some of the guilt. In such cases, what’s the Association of Psychologists doing? Or the other Profs. at the University etc.? Surely we on the blog aren’t the only Profs. spotting the hype & errors?

      Why you see one Prof. being repeatedly misinterpreted / hyped / misleading etc. why don’t Profs. speak up in organized ways to correct the wrong? Admittedly, blogs are stepping in to do some of this useful function.

      But your Professional bodies, associations, departments & other official fora are totally voiceless & passive at protecting the integrity of the scientific message.

      • Shravan says:

        Agreed, Rahul.

        What can I do personally? I am not associated with psychology or funding bodies in any official capacity, and the only professional memberships I have are in statistics (there’s a reason for that), with the exception of the Math Psych society, which is not really psychology at all as we understand it on this blog. I can and do point out all the problems in reviews I do. And I also write papers in linguistics and psycholinguistics to try to correct what I consider to be wrong (and I may be wrong about that too!). So, on a much smaller scale than Andrew, and much less in-your-face as the Slate piece is (I can see why Cuddy might react so negatively, it feels so personal).

        I guess there are more powerful people out there who could actually move things away from the current state.

        But I do concede your point. Talk is cheap, and people (including me) just sit on the sidelines, occasionally look up from their RStudio session, criticize someone on a blog comment, and go right back to RStudio-ing.

        What would be an appropriate systemic solution though? Educating social psychologists about statistical theory and applications would be a start, I guess. Is there *any* social psychologist out there doing this right? I am assuming that Darwinian processes would render him/her jobless if they end up with a null result each time, so the field will only be populated by p-barely-below-threshold-by-hook-or-by-crook researchers? OTOH, maybe the people who attract Andrew’s attention are still in a minority and the field as a whole is doing OK (I mean relatively; look at medicine as a baseline, totally messed up). As usual, we need some data on this to assess if intervention is needed at all.

        • Keith O'Rourke says:

          > the field will only be populated by p-barely-below-threshold-by-hook-or-by-crook researchers?
          I recently reviewed a published paper by a group of pre-tenured faculty (about 3 to 4 years into their first appointments) and a tenured faculty member who earlier without them had published one of the best methodological papers their field.

          In the application paper I reviewed, the authors claimed to have used that methodology developed by the tenured author but actually had deviated in using the method NOT recommended in that earlier paper as it was not conservative enough. Using that not recommended method got them a few p_values just under .05 whereas the recommended method would have had none.

          I was a bit surprised the tenured faculty member went along with this (being on that application paper) but maybe they did not notice, or thought no one else would notice (the journal didn’t) or maybe they felt they had to take the hit to help these junior faculty keep their jobs. My guess is it was the latter.

    • Andrew says:

      Shravan:

      After reading your comment I took a look at the paper more carefully and Wow, yes, lots of forking paths:

      To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses.

      A reasonable decision, except that in other papers in this literature this is not always done. Separation by sex alone gives several forking paths: You can analyze the data as is, you can run a regression by sex and consider the coefficient for sex as a control variable, you can look at the coefficient of sex, you can interact treatment effect by sex, you can look at effects separately for men and for women. you can look at one hormone for men and the other for women, etc.

      All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.

      I’m not clear here on what they did. When analyzing cortisol at time 2, which of the other variables (cortisol at time 1, testosterone at time 1, testosterone at time 2) did they control for? Different options would be possible. As usual, the point here is not whether the authors did multiple analyses and picked the best, it’s that they had flexibility in choosing their analyses in light of the data. Indeed, in their paper they never claim otherwise, and various different ways of analyzing these sorts of data are present in the literature, including in other papers by the same authors.

      But it is not clear what was actually done. In the above passage it seems pretty clear that all analyses of hormones were controlling for other hormone measurements, but then the main results that follow don’t seem to have such controls, they’re just basic Anovas.

      Also this, in a footnote:

      Cortisol scores at both time points were sufficiently normally distributed, except for two outliers that were more than 3 standard deviations above the mean and were excluded; testosterone scores at both time points were sufficiently normally distributed, except for one outlier that was more than 3 standard deviations above the mean and was excluded.

      What??? Did they just say that they excluded data from 3 subjects?

      No wonder Simmons and Simonsohn wrote that these results were consistent with no effect. With a small sample you only have to do a little bit of selection and you can get impressive-looking results.

    • Anon says:

      “When scientists start to see themselves as celebrities first, it’s time to quit and move on. I sometimes feel that Entsophy’s (or is it one of the Anonymous guys now?) aggressive and out-of-control ravings against professors has not been entirely without merit.”

      It’s important to always blame “The system” and “incentives”. It’s important to never put any blame, or responsibility on the actual professors themselves. Wouldn’t it be great if mere mortals (i.c. non-academics) could all use this as an excuse for our behaviour!

      I am just wondering if Cuddy et al. have done any replications, or follow-up research themselves. If i were to have given a TED-talk about power-posing saying that people should tell this stuff to other people who “could benefit from it”, i would want to make sure that what i am telling people is correct. For me that would mean not doing a TED-talk in the first place with so little “evidence”, but aside from that it would mean doing lots of replications and follow-up research on the topic. The only related follow-up research i could find was this one:

      Cuddy, Amy J.C., Caroline A. Wilmuth, Andy J. Yap, and Dana R. Carney.”Preparatory Power Posing Affects Nonverbal Presence and Job Interview Outcomes.” Journal of Applied Psychology 100, no. 4 (July 2015): 1286–1295.

      I also wonder what her file-drawer looks like…

      • Shravan says:

        This paper had a higher sample size (61). They also got not on the border p-values this time (on the nice side of the threshold).

        And from the cited paper:

        “The difference between high-power and low-power posers’ self-reported feelings of power (high-power: M = 2.47, SD = 0.93; low-power: M = 2.04, SD = 0.93) was marginally significant, F(1, 60)=3.258, p = .076.”

        Actually, why *shouldn’t* one consider a p of 0.05 or 0.07 or even 0.10 or 0.20 significant? It would make life much easier for publication, at least.

        • Anoneuoid says:

          Alpha is the expected value of p for each field. If you can collect more and cleaner data, alpha gets small like in particle physics to 3e-7. If your data is really messy and small then p gets larger (probably something like small-scale ecology studies use alpha=0.1). In most cases a moderate balance is apparently struck at about alpha=0.05.

        • Nick says:

          “The difference between high-power and low-power posers’ self-reported feelings of power (high-power: M = 2.47, SD = 0.93; low-power: M = 2.04, SD = 0.93) was marginally significant, F(1, 60)=3.258, p = .076 (d = 0.46, ηp2 = .053) (see Table 2).”

          Followed by
          “This finding is consistent with past research showing that power posing has a weak impact on self-reported feelings of power despite its stronger effects on cognitive and behavioral outcomes”

          Yes, because a p value of .076 is exactly the same thing as an indication of a weak effect.

          • Andrew says:

            Nick:

            Good catch. I’ve seen this mistake before but have never thought about it much. Now I’m thinking it’s part of a whole mistaken worldview, what Tversky and Kahneman called the belief in the law of small numbers.

            Also interesting to think about how Carney, Cuddy, and Yap would react to all the information in this thread. My guess is that their inclination would be to fight back or to just ignore us and hope we go away. But if they really care about the effects of posture on success, and if they really think their research is relevant to such questions, they should want to know what they got wrong, no?

            This is the part that still puzzles me: I can only assume that these researchers do think this topic is important and that their research is relevant. I have no reason to think that they’re hustlers, cackling all the way to the bank. And I assume they also think their data support their claims; I have no reason to think that they’re Marc Hausers, playing shell games with monkey videos. But still I can’t picture them engaging with any of this criticism, especially given their essentially empty responses to Ranehill et al. and to Simmons and Simonsohn.

            So how do Carney, Cuddy, and Yap see this? It’s harder to say for Carney and Yap but they did coauthor the article and the response to Ranehill et al. so I assume they’re with Cuddy on this one, just keeping a lower profile, which reduces both the upside and the downside of being associated with (a) this research and (b) worse, the tenacious defense of this research.

            I’m guessing that they think that everything where the p-value in their data was less than .05 (and everything where the p-value was more than .05 but where they mistakenly computed as being under that threshold) represents a large and true effect.

            And that when they computed the p-value as greater than .05, they believe that, as the experimenters, they get to choose whether this represents a real and small effect, or that it represents no effect.

            I’m guessing they also believe that all their conclusions are (a) interesting, surprising, and novel, and (b) perfectly consistent with their underlying theory. And that the particular analyses they published were the only analyses they performed on the data, and that had the data been different, their data-selection and data-analyses rules would’ve been the same.

            And I’m guessing they think that all of us—you, me, Simmons, Simonsohn, Ranehill, Dreber, etc.—have an ax to grind and that nothing we say is worth listening too.

            And, indeed, if you accept the (statistically unjustified and implausible) view that everything they saw in their sample reflects general patterns in the population—then all our criticisms are indeed irrelevant. If they already know the truth, then any statistical or methodological criticisms are meaningless technicalities, and any failed replications just indicate that the replicators don’t understand what they’re doing.

            For Carney, Cuddy, and Yap to move forward, they have to accept that they might be wrong, they need to accept the principle that the sample is not a perfect reflection of the population.

            In this case I think we’d all be better off had the concept of statistical significance (or other misusable techniques such as Bayesian posterior probabilities) had never been invented. Cos then Carney, Cuddy, and Yap would have to face up directly to their assumption that sample is like population, they couldn’t hide behind statistical significance and use a naive view of statistical significance as an excuse to dodge all criticism.

            • Shravan says:

              Andrew, are Bayesian posterior probabilities misusable because one can make the same kind of binary decisions as with p-values? Or is it something else entirely?

              About your other points: just remember that they probably don’t even know they did anything wrong in their analytical methodology. That’s probably why Cuddy called you a bully.

              This is how I started out doing statistics too, when I had to analyze data for the first time in 1999 or 2000 or so. I remember the feeling of puzzlement when statisticians start criticizing you. They might be in the same boat.

              • Nick says:

                I have the strong suspicion that if p values were made illegal by an act of Congress tomorrow and journal articles had to report Bayes factors instead, within a very short time you would have some magic value (e.g., 10) that became the new p=.05. Maybe BFs are less easy for the lazy or clueless researcher to hack than p values, but I think we need to acknowledge that the sociology of the research and publishing process plays a larger role (and theoretical deficiencies a corresponding smaller one) in the problems associated with p values than we might like to admit. Journals, especially in psychology, are going to continue to want nice, neat, yes/no decision criteria, and the news media that consumes this stuff even more. If they can’t start an article with “Scientists have discovered that…”, they won’t be interested. (We got a good look at the realities of the knowledge production process this last few days with the Fifth Quarter Fresh thing.)

          • Alex says:

            They do list an effect size of 0.46, which is medium-y by Cohen’s guidelines. But that’s a matter of taste.

      • Andrew says:

        Anon:

        We should blame the individuals involved and also blame the aspects of the system that incentivize these behaviors.

        • Anonymous says:

          The vast majority of academics realize their entire career amounts to gaming the system so that taxpayer dollars shield them from never having to get anything right. Every graduate office I’ve ever seen has “how to” guides written by senior academics on how to mold their career to achieve exactly that.

          For example, in Economics they’ll tell you to find any unused novel data set so you can to write a gimmicky paper that lands in a top journal. This has to be done by a certain point or you’re toast in Econ. Whether the paper uncovers a truth or not is irrelevant.

          The ones I feel sorry for though, are the ones who drank the coo-laid on Frequentist statistics and genuinely believed they were learning a difficult and esoteric skill which allowed them to do “real” science. Now when it’s too late for them, they’re finding out not only was classical statistics a con, but everyone except them is wise to the con.

          • Brad Stiritz says:

            >The ones I feel sorry for though, are the ones who drank the coo-laid on Frequentist statistics and genuinely believed they were learning a difficult and esoteric skill which allowed them to do “real” science..
            > not only was classical statistics a con, but everyone except them is wise to the con

            Objection: as I understand it, the issue is that classical statistics runs into problems with low-N, low-power studies. Careful design and adequate sampling in fact do allow for sound Frequentist inference and “real science”.

            Please someone correct me if I’ve got this wrong?

            • Anonymous says:

              Insisting on high powered tests improves Frequentist procedures because it get’s them closer to the Bayesian answer. But it doesn’t go all the to being Bayesian. Consequently, it’s trivial to produce tests (by Neyman-Pearson’s definition of a test) which have high power and low alpha, but which yield ridiculously bad and obviously wrong results.

            • Anonymous says:

              For some reason there’s this strong force pushing people to claim some version of “Frequentist isn’t wrong, it just hasn’t been applied with enough fanatical zeal”.

              This is so contrary to the past 100 years of statistical history it just blows my mind everyone keeps insisting (without evidence) it’s true.

              • Brad Stiritz says:

                >For some reason there’s this strong force pushing people to claim some version of “Frequentist isn’t wrong, it just hasn’t been applied with enough fanatical zeal”.

                What is this straw man you’re pursuing?! If I understand correctly, within the (bio)medical community, much of the historical, and current, *and let’s not forget, overall highly effective* conceptual infrastructure is based on Frequentist & Type I/II thinking (including “sensitivity” and “specificity”). So we’re talking NIH, FDA, MD/PhD researchers, and practicing + teaching MDs. This must comprise 10Ks of people within this country alone.

                If there were a significant “effect size” to your claim (“Frequentist is wrong vs Bayesian”), that were unconditionally applicable across the board, I daresay we would see medical science “in crisis”, along the lines that’s being exposed in the social sciences.

                -> Can you cite evidence that that medical science is in statistical crisis, overall, and therefore dangerous?
                -> Per your claims, should we not be observing widespread scandal within oncology, for example? Pharma bankruptcies due to “wrong” Frequentist efficacies claims? Droves of cancer doctors being tried for malpractice and losing their licenses, due to Frequentist quackery?

                >This is so contrary to the past 100 years of statistical history it just blows my mind everyone keeps insisting (without evidence) it’s true.

                What do you mean *precisely* by this? Less heat, please, and more light?

              • Anoneuoid says:

                >”If I understand correctly, within the (bio)medical community, much of the historical, and current, *and let’s not forget, overall highly effective* conceptual infrastructure is based on Frequentist & Type I/II thinking (including “sensitivity” and “specificity”). So we’re talking NIH, FDA, MD/PhD researchers, and practicing + teaching MDs. This must comprise 10Ks of people within this country alone.”

                I am not the same person as “Anonymous”, and would attribute this problem to NHST* rather than frequentism per se. But yes, this is an accurate description of the problem. Taken to the logical conclusion we need to redo everything from at least ~1980 when the majority of those not trained to do NHST had retired/died. In certain areas we need to go back even earlier (depending on when NHST became prominent). So many people have wasted their lives confusing each other due to that practice it is just unbelievable.

                Fisher even predicted it:
                “We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort.”
                http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf

                *NHST: use the default strawman nil-null hypothesis not predicted by a theory then interpreting rejection of that to mean your research hypothesis is true.

              • Andrew says:

                Anoneuoid:

                Yeah, that’s my point. “NHST” is the bad thing. P-values are one way to implement NHST, and Bayes factors are another. The problem to me is with the NHST, the idea of false positives and false negatives and all that crap. Not with exactly how it’s done.

              • Anoneuoid says:

                Also wrt “I daresay we would see medical science “in crisis”, along the lines that’s being exposed in the social sciences.” There has been discussion on this recently (I’m sure I have missed some):

                Sept 2011: Bayer reports ~25% of published biomed research results are reproducible.
                http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html

                May 2012: Amgen reports ~10% cancer results are reproducible.
                http://www.nature.com/nature/journal/v483/n7391/full/483531a.html#ref4

                June 2015: A cancer reproducibility project was initiated, but after awhile people start getting disenchanted and leaving because it is not even worth reproducing such flawed research.
                http://science.sciencemag.org/content/348/6242/1411

                June 2015: It is estimated that half of all US biomed research funds are wasted on not even reproducible in principle reports.
                http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165

                December 2015: Continuing info on the cancer reproducibility effort: A quarter of the initial studies needed to be dropped before replications because it was so difficult to get methodological information and materials.
                http://www.nature.com/news/cancer-reproducibility-project-scales-back-ambitions-1.18938

                Andrew wrote:
                >”Yeah, that’s my point. “NHST” is the bad thing. P-values are one way to implement NHST, and Bayes factors are another. The problem to me is with the NHST, the idea of false positives and false negatives and all that crap. Not with exactly how it’s done.”

                I agree.

          • Shravan says:

            “The vast majority of academics realize their entire career amounts to gaming the system so that taxpayer dollars shield them from never having to get anything right. “

            Where is the data to back this claim up? I know a few people like that, whom I am sure are just gaming the system as you describe, but I know a larger proportion that I know is not.

            “Every graduate office I’ve ever seen has “how to” guides written by senior academics on how to mold their career to achieve exactly that.”

            “in Economics they’ll tell you to find any unused novel data set so you can to write a gimmicky paper that lands in a top journal. This has to be done by a certain point or you’re toast in Econ. Whether the paper uncovers a truth or not is irrelevant.”

            I have never seen anything like this in my life. Can you provide a scan or a link to any such claim anywhere in any academic department?

            These kind of claims can only be seen as bogus without any hard evidence to back it up.

        • Anonymous says:

          What academia desperately needs is a period of chaos. A genuine unstructured free for all.

          It’s a bit like annealing (or Simulated annealing to the machine learning people). By raising the temperature and making things more chaotic there is a chance the system will cool back down to a better optimum.

  24. Anon says:

    ” First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). “

    If i am not mistaken, Cuddy et al. do not seem to have a problem with holding the poses for 6 minutes in their other research:

    http://faculty.haas.berkeley.edu/dana_carney/pp_performance.pdf

    “Participants maintained the poses for a total of five to six minutes while preparing for the job interview speech”

    • Rahul says:

      Ha. Good catch. What say you, Cuddy? More bullying?

    • Stuart Buck says:

      Maybe I’m missing something, but why is Cuddy saying in the first place that the discrepancy was between a 2-minute pose and a 6-minute pose in Ranehill et al., one of the Ranehill et al. co-authors (up thread) characterized it this way: “With respect to the time spend in the poses we had participants hold the poses for 3 min each, instead of 1 min each as in Carney et al.”

      Has Cuddy multiplied the time by 2 for each study?

      (Note: the actual studies linked here describe the time as 1 minute and 3 minutes, respectively, not 2 minutes and 6 minutes: datacolada.org/2015/05/08/37-power-posing-reassessing-the-evidence-behind-the-most-popular-ted-talk/).

  25. Brad Stiritz says:

    >The problem to me is with the NHST, the idea of false positives and false negatives and all that crap

    Hi Andrew. Per my query to Anonymous above: If sensitivity and specificity are truly crap analyses, as you posit, are they then destined to be sued out of use, as statistical/medical malpractice? Surely there should already be an instructive, sensational court case?

    https://en.wikipedia.org/wiki/Sensitivity_and_specificity

    • Andrew says:

      Brad:

      1. I think there are applications for which the concept of false positive and false negative, sensitivity and specificity, ROC curves, etc., are relevant. There are actual classification problems. I just don’t think scientific theories are typically best modeled in that way. I don’t think it makes sense to talk about the incumbency advantage or power pose being true or false, for example.

      2. Even when “false positive and false negative” don’t make sense, the larger concepts represented by sensitivity and specificity can still be relevant. I tried to capture some of this with type M and type S errors but that framework is a bit crude too. I think it’s a good idea for us to continue working to express mathematically the ideas of scientific hypothesis building and testing without getting trapped in a naive acceptance/rejection framework. This theorizing was fine in the 1940s—statistical researchers had to start somewhere—but we can move on now.

    • Anoneuoid says:

      >”Surely there should already be an instructive, sensational court case?”

      SCOTUS ruled against NHST six years ago:
      “For the reasons just stated, the mere existence of reports of adverse events—which says nothing in and of itself about whether the drug is causing the adverse events—will not satisfy this standard. Something more is needed, but that something more is not limited to statistical significance and can come from “the source, content,and context of the reports,” supra, at 15. This contextual inquiry may reveal in some cases that reasonable investors would have viewed reports of adverse events as material even though the reports did not provide statistically significant evidence of a causal link.”
      http://www.supremecourt.gov/opinions/10pdf/09-1156.pdf

      There may be earlier or better examples. I suspect there is probably a long history of courts ruling NHST inadequate, but this is well outside my area of research expertise.

  26. Brad Stiritz says:

    Anoneuoid: thank you for taking time to research and provide links. I don’t feel the citations decisively support your very strong claims, but they’re very interesting in their own right.

    Regarding the Supreme Court case you mentioned, I didn’t get too far into it before finding the following:

    “Matrixx’s premise that statistical significance is the only reliable indication of causation is flawed. Both medical experts and the Food
    and Drug Administration rely on evidence other than statistically significant data to establish an inference of causation”

    I have so far not come across any Frequentist or NHST teachings that draw a link between significance and causation. In fact, the texts and teachers I work with have all stressed SCOTUS’ point precisely.

    Under high pressure, many people will choose to “Say something, anything!” Matrixx tried a junk-science approach and lost.

    • Anoneuoid says:

      >”I have so far not come across any Frequentist or NHST teachings that draw a link between significance and causation.”

      I’m pretty sure if we actually check, that we will find this is common. Can you give some examples that are available online?

      Everyone doing medical research that uses NHST does this. I went and checked NEJM and picked the first paper I saw. Amazingly, here it is in the very first sentence of the very first paper I checked:
      “In previous analyses of BENEFIT, a phase 3 study, belatacept-based immunosuppression, as compared with cyclosporine-based immunosuppression, was associated with similar patient and graft survival and significantly improved renal function in kidney-transplant recipients.”
      http://www.nejm.org/doi/full/10.1056/NEJMoa1506027

      No, you cannot accept your favorite hypothesis that “blatacept-based immunosuppression with improve renal function” just because there was a statistically significant difference between that group and the controls. Even looking later in the paper they describe other reasons for this difference (drug administration orally at home vs IV at the doctor).

      • Brad Stiritz says:

        If I understand you correctly, and without reading the paper: Yes I agree, it may have been highly inappropriate of them to have used the word “benefit” (which you capitalized for emphasis). Benefit does imply causation, and apparently only association was demonstrated.

        “Possible benefit” would have been an acceptable choice of words, don’t you think?

        https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

        • Anoneuoid says:

          BENEFIT is apparently an acronym for that project, I did not add emphasis. There were also at least two typos in my post: “blatacept-based immunosuppression with improve renal function”. That should be “belatacept-based immunosuppression will improve renal function”. I’m not sure if that contributed to the confusion.

          The first sentence of the paper I quoted, along with the lack of investigating other explanations for the “improved renal function”, indicates my point. The authors have taken a significant difference between the two treatment groups to mean “immunosuppression due to belatacept treatment has caused improved renal function.” This is despite that they are clearly aware of other differences between the groups.

          Also, I bothered to look a bit further. If you follow the citation regarding renal function assessment[1], you can find the difference between two groups is equal to the average difference between male/female or white/black (plus who knows what other subgroups). In fact, the estimate of renal function is determined by someones race/sex (eg take some serum measurement multiplied by .8 if the patient is a female). So there are likely many reasons for such a deviation from the null hypothesis that need to be investigated.

          My point is, I did not need to look far to find someone rejecting a null hypothesis and accepting the research hypothesis. This paper is in no way exceptional. Where are these people learning it?

          [1] http://www.ncbi.nlm.nih.gov/pubmed/10075613

          • Brad Stiritz says:

            Thank you for clarifying and for your further investigation. I’m starting to understand where you’re coming from. Definitely agreed that researchers need to take statistical competence seriously!

            >I did not need to look far to find someone rejecting a null hypothesis and accepting the research hypothesis. This paper is in no way exceptional. Where are these people learning it?

            To drop a few names : Taleb, Kahneman, and Pinker have all written extensively on the theme that proper statistical thinking is highly non-intuitive for most people. Andrew Gelman touches on this as well, here and there in his blog. The widespread availability of cookie-cutter tools and tests makes it very easy to “cut to the chase”. To be generous, perhaps people don’t realize that they don’t understand? To be realistic, it’s the rare person who doesn’t cut corners *somewhere*..

            We might imagine a better system, modeled on the construction industry, in which research papers would require sign-off by a licensed statistician. This seems likely to be accepted practice sometime in the future ;)

            • Anoneuoid says:

              >”We might imagine a better system, modeled on the construction industry, in which research papers would require sign-off by a licensed statistician.”

              The problem with the usual approach is there is a many to one mapping of research hypotheses to the statistical “alternative hypothesis”, so rejection of the null hypothesis can only lead to affirming the consequent errors.

              This can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis (it is obvious to anyone not trained in NHST that it should always have been this way). Then, when the research hypothesis/theory has survived a strong test (due to the precision and accuracy of the prediction), we tend to believe it has something going for it. So I don’t think your idea will work since this is not actually a problem in the realm of statistics. See eg fig 2 here:

              Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It. Paul E. Meehl. Psychological Inquiry. 1990, Vol. 1, No. 2, 108-141. http://www.psych.umn.edu/people/meehlp/WebNEW/PUBLICATIONS/147AppraisingAmending.pdf

              • Brad Stiritz says:

                Thank you for the interesting link.

                >..So I don’t think your idea will work since this is not actually a problem in the realm of statistics.

                Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)

              • Anoneuoid says:

                >”Hmm, I thought we had agreed that there is actually a problem with non-statisticians misapplying statistics..? I predict this will continue to be an issue for some time.. ;)”

                The typical end user of statistics has no conception of a distribution and uses SEM error bars rather than SD because they are narrower. They will have to take the time to educate themselves, there is nothing we can do besides ignore their analyses.

                Still, I doubt you can find a recent statistics textbook that gives an example of some form of “significance/hypothesis test” and not make the error I described above. I have seen that Student makes it, Neyman makes it, but interestingly I have never seen Fisher fall prey to it.

  27. Richard D. Morey says:

    There’s further discussion of the Slate piece on the ISCON Facebook group where Paul Coster accuses people against hyping un-peer-reviewed work of hypocrisy for not critiquing the Gelman and Fung piece, because it discusses a blog post (apologies for my first post forgetting that you wrote it, where I refer to the authors as “journalist”).

  28. Brad Stiritz says:

    Anoneuoid>I doubt you can find a recent statistics textbook that gives an example of some form of “significance/hypothesis test” and not make the error I described above.
    >..can be dealt with by instead having the statistical null hypothesis correspond to a precise prediction of the research hypothesis

    Bock et al, “Stats : Modeling the World” 4e (2015) is an excellent AP-level HS text. From Chapter 19, “Testing Hypotheses About Proportions”, p. 497:

    “In statistical hypothesis testing, hypotheses are almost always about model parameters.. The null hypothesis specifies a particular parameter value to use in [the] model.. We write (Hsub0) : parameter = hypothesized value. (HsubA) contains the values of the parameter we consider plausible when we reject the null.”

    • Anoneuoid says:

      I was unable to get access to the book. But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

    • Anoneuoid says:

      Not sure if these are official, but check out ppt 21 slide 5:
      -There is a temptation to state your claim as the null hypothesis.
      –However, you cannot prove a null hypothesis true.
      -So, it makes more sense to use what you want to show as the alternative.
      –This way, when you reject the null, you are left with what you want to show.
      https://mhsapstats.wikispaces.com/BVD+Powerpoints+and+Chapter+Notes

      Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”, ie the student is taught to make the basic logical error of affirming the consequent.

      • Anoneuoid says:

        I realized something really disturbing here. This stats teacher has learned that high school students find it “tempting” to test their actual research hypothesis. This is consistent with my personal experience (although after HS), I still remember being in their shoes and also finding it “tempting” to test my actual hypothesis rather than some other default hypothesis.

        As early as high school, students are thinking scientifically but are being specifically told to stop doing so during class. This must be happening all across western civilization at earlier and earlier ages. I really didn’t believe that my estimation of degree of damage being caused by NHST could get any worse, but I had never considered that the age at which people start learning this stuff is also becoming younger.

      • Brad Stiritz says:

        >But what I would look at is how the testing procedure is actually applied in the example problems and what conclusions are drawn or asked of the student. Can you give some examples of these?

        With respect, it sounds like you may be moving the goalposts here. I provided a specific example, which you doubted I could provide. All three of the my text’s authors are award-winning educators. It seems a bit presumptive to suggest (particularly from safe anonymity) that they actually don’t understand what they’re teaching, and are in fact misteaching statistics.

        >Here we have an explicit encouragement to reject the null hypothesis and accept “your claim”…

        I’m not sure what your criticism is? The “claim” is a well-defined algebraic statement involving a parameter value, as you insisted earlier.

        >..ie the student is taught to make the basic logical error of affirming the consequent.

        There is a sidebar on the page I referenced above, right next to the paragraph I quoted. In large font is printed Fisher’s classic quote, ending with:

        “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

        As far as I can tell, Bock et al’s logic directly follows Fisher’s. Now, are you claiming that this type of classic NHST inference is logically deficient? I’m having trouble finding authoritative discussion on the web in support of your thesis. Let’s please work out exactly what you mean. What do you think of the next paragraph, which I have formulated on my own, following the reference below. I believe it summarizes NHST inference, in the form of “If P, then Q”

        If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.

        Per reference below, “Affirming the consequent” follows the form:

        If P, then Q.
        Q.
        Therefore, P.

        Would you please demonstrate via simple example how an NHST deduction, following the lines of my conditional above, might fall prey to the fallacy?

        Reference: https://en.wikipedia.org/wiki/Affirming_the_consequent

        • Anoneuoid says:

          >”With respect, it sounds like you may be moving the goalposts here.”

          Sorry for the miscommunication. I originally asked for ‘an example of some form of “significance/hypothesis test”’, I later again asked: “Can you give some examples of these[example problems and what conclusions are drawn or asked of the student]”? These are meant to ask for the same thing, ie an actual example of the testing procedure being applied. Since it is not (easily) possible for me to check this book myself, I asked for additional info so that my initial request could be met. There is no moving the goalposts.

          >”Bock et al’s logic directly follows Fisher’s…If a precise null hypothesis is found extremely unlikely to be true, then an appropriately defined alternative hypothesis may be accepted as significantly more likely to be true, subject to the statistical power of the particular significance test used.”

          1) The first issue with what you have written is only tangential to my point and I do not wish to focus further on it (since it has been a major a distraction from the main problem), but the alternative hypothesis was not a part of Fisher’s logic. You appear to be working with a hybrid of Fisher and Neyman-Pearson. You can search “NHST hybrid” for much discussion on this, but for an introduction to this phenomenon see: Mindless statistics. Gerd Gigerenzer. The Journal of Socio-Economics 33 (2004) 587–606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf

          Also, here is Fisher on the alternative hypothesis and type II error:

          “It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly from errors in which it is “accepted wrongly” as the phrase does. The frequency of the first class, relative to the frequency with which the hypothesis is true, is calculable, and therefore controllable simply from the specification of the null hypothesis. The frequency of the second kind must depend not only on the frequency with which rival hypotheses are in fact true, but also greatly on how closely they resemble the null hypothesis. Such errors are therefore incalculable both in frequency and in magnitude merely from the specification of the null hypothesis, and would never have come into consideration in the theory only of tests of significance, had the logic of such tests not been confused with that of acceptance procedures.”
          Ronald Fisher. Journal of the Royal Statistical Society. Series B (Methodological). Vol. 17, No. 1 (1955), pp. 69-78. http://www.phil.vt.edu/dmayo/PhilStatistics/Triad/Fisher%201955.pdf

          2) I linked above to the Meehl (1990) paper where this is explained in (very great) detail, I also gave a real life example of it in action with the NEJM paper. The problem arises from the many-to-one mapping of research hypotheses to a vague statistical alternative hypothesis, that is why I said this issue is “not actually a problem in the realm of statistics”.

          Your example does not include any research hypothesis, only two statistical hypotheses. This scenario does not correspond to any actual use case I can think of. In practice it usually goes like this:
          -A)”If P [the research hypothesis H is true] then Q [the parameter (difference between means) will be greater than zero].
          -B) Q [the null hypothesis that the parameter is equal to zero is unlikely and also the parameter was measured to to be positive therefore: the parameter is likely to be greater than zero].
          -C) Therefore, P [the research hypothesis H is true].

          We can even forget that we have uncertainty about the value of the parameter (ie Omniscient Jones told us the value) and change step B to “Q [the parameter is greater than zero].” As we see, considering this limiting case where our uncertainty approaches zero does not fix the problem.

          Note that if you observe (where, ~ = “not”) ~Q [the parameter is not greater than zero], it is valid to deduce ~P [the research hypothesis H is not true] in this simplified description. In reality though, P is never so simple. The theory is never tested alone, there are also auxiliary considerations A (eg no malfunctioning equipment, etc). So in practice ~P = (~H and/or ~A). In other words, the data or some other assumption can be wrong instead of, or in addition to, the research hypothesis.

          • Keith O'Rourke says:

            Thanks for reminding me of this quote of Gerd’s

            “Who is to blame for the null ritual?
            Always someone else. A smart graduate student told me that he did not want problems with
            his thesis advisor. When he finally got his Ph.D. and a post-doc, his concern was to get
            a real job. Soon he was an assistant professor at a respected university, but he still felt he
            could not afford statistical thinking because he needed to publish quickly to get tenure. The
            editors required the ritual, he apologized, but after tenure, everything would be different
            and he would be a free man. Years later, he found himself tenured, but still in the same
            environment. And he had been asked to teach a statistics course, featuring the null ritual.
            He did. As long as the editors of the major journals punish statistical thinking, he concluded,
            nothing will change.”

          • Brad Stiritz says:

            Thank you A.! I sincerely appreciate you breaking this down for me :) Thanks also for clarifying your previous posts and pointing out that I’m thinking in hybrid NHST terms.

            I didn’t want to get further into the textbook before getting all this conceptual groundwork sorted out. Thank you for your patience with me on that. Here is a worked example from Bock et al, 4e (big thanks to ABBYY OCR! ;). It appears shortly after the above-quoted statements about the null hypothesis:

            (begin quote)
            Step-by-step example: Testing a hypothesis

            Advances in medical care such as prenatal ultrasound examination now make it possible to determine a child’s sex early in a pregnancy. There is a fear that in some cultures some parents may use this technology to select the sex of their children. A study from Punjab, India (E. E. Booth, M. Verma, and R. S. Beri, “Fetal Sex Determination in Infants in Punjab, India: Correlations and Implications,” BMJ309 [12 November 1994]: 1259-1261), reports that, in 1993, in one hospital, 56.9% of the 550 live births that year were boys. It’s a medical fact that male babies are slightly more common than female babies. The study’s authors report a baseline for this region of 51.7% male live births.

            Question: Is there evidence that the proportion of male births is different for this hospital?

            Hypotheses:
            The null hypothesis makes the claim of no difference from the baseline.

            The parameter of interest, p, is the proportion of male births:
            Hsub0 : p = 0.517
            HsubA : p 0.517 The alternative hypothesis is two-sided.

            Model : Think about the assumptions and check the appropriate conditions. (content skipped for brevity)

            Mechanics : (content skipped for brevity)

            Conclusion : The P-value of 0.0146 says that if the true proportion of male babies were still at 51.7%, then an observed proportion as different as 56.9% male babies would occur at random only about 15 times in 1000. With a P-value this small, I reject Hsub0. This is strong evidence that the proportion of boys is not equal to the baseline for the region. It appears that the proportion of boys may be larger.

            State your conclusion in context : That’s clearly significant, but don’t jump to other conclusions. We can’t be sure how this deviation came about. For instance, we don’t know whether this hospital is typical, or whether the time period studied was selected at random. And we certainly can’t conclude that ultrasound played any role.
            (end quote)

            Applying your H/P/Q logic, we would seem to have:

            H : The proportion of male live births may be significantly atypical at one particular hospital
            P : H is true
            Q : The appropriate significance-test P-value will be very low, indicating that the proportion for this hospital is unlikely to reflect the population value for all live births in the region.

            So the authors do infer from Q “strong evidence” that H is true. However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice. They also explicitly caution the student against inferring a broader “causal” H from Q.

            My vague memory is that at least some high-profile epidemiological research has concluded in similar fashion : “What causal H could be responsible for Q? Here are some ideas : H1,H2,H3,etc”. In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.

            • Keith O'Rourke says:

              > some high-profile epidemiological research
              If it is an epidemiological study there is confounding and bias (or at least these can’t be ruled out) and this implies the distribution of p_values when the null is true is undefined and that makes this claim non-sense “observed … as different as … would occur at random only about x times in 1000”.

              Surprising how many, even those doing teaching and research appear to be unaware of this, see http://www.stat.columbia.edu/~gelman/research/published/GelmanORourkeBiostatistics.pdf

              Now, the book you are quoting does seem to be at least (implicitly) pointing that problem out.

              • Anoneuoid says:

                Keith, are you aware of anyone that has measured the distribution of p-values that result from samples taken from supposedly the same population under very controlled settings with randomization? I think that could provide a lower bound on the types of deviation from uniform we would expect.

              • Keith O'Rourke says:

                Anoneuoid:

                Fisher in the 1930’s with agricultural uniformity trials.

                Now, Peirce in 188? argued that randomization clearly justified that a random sample of a population justified inference.

                More recently, Efron for gene expression studies.

                Epidemiological studies are a different kettle of fish.

            • Anoneuoid says:

              Thanks for posting the example. Let’s start with this: “However, their H seems like a pretty conservative and “statistical” (to use your term) type of hypothesis, similar to what I proposed, which you said isn’t seen in practice.”

              1) First, I went and got the paper[1] and found the authors did not perform this analysis, so this is not an example of something found in practice. Second, I did not claim anything wasn’t “seen in practice”, rather that “scenario does not correspond to any actual use case I can think of”. Indeed, the textbook acknowledges there is no actual use for this test: “That’s clearly significant, but don’t jump to other conclusions.”

              Exactly! ***The extent of the appropriate conclusion is limited to a statement about the statistical significance*** When properly interpreted, the outcome of a default-nill-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions, the only “use” is to proceed to commit any of a number of logical fallacies.

              2) I also found that some of the information in this question appears to be incorrect. Specifically, the number of 550 births in 1993 appears made up and the baseline was from the same hospital ~10 years earlier, not “for the region”. So in terms of real life applicability, this analysis is also an example of GIGO.

              3) Ignoring the above, they still do commit the affirming the consequent error we have been discussing. The problem is that reported/registered live births does not necessarily correspond to actual live births. So the actual sex ratio of births at this hospital could be exactly the same as the baseline, but male births are reported more often for some reason.

              As described in the question, there is a preference for male births in India. Do the hospitals have any incentive to report a greater percentage of male births? Perhaps more patients will choose to go there out of superstition because they hear male births are more common at that hospital, so the high sex ratio amounts to advertising. Or perhaps there is a larger number of male vs female still-births, and still-births reflect badly on the hospital. Then these still-births may get inaccurately recorded as live births. Maybe there are more unwanted female babies that get abandoned at the hospital which would lead to undesirable paperwork for the staff, so instead they just drop them off at the orphanage or find a foster home off the books. Etc, etc. The number of plausible reasons for the mere presence of a deviation from the null hypothesis is essentially limitless.

              4) With regards to: “In words, the researchers are confident something very significant is going on, but don’t know why. Autism and bee colony collapse disorder are two headline news examples which may have this type of research behind them.” The news coverage is based on the magnitude of the change, not mere statistical significance. Also, these are two areas where nothing has been figured despite many years of research. Instead, wild speculation, fraud, and conspiracy theories have proliferated. So I don’t understand why/how you want to to interpret them as success cases for NHST.

              [1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2541782/

              • Brad Stiritz says:

                Anon> ***The extent of the appropriate conclusion is limited to a statement about the statistical significance***
                >the outcome of a default-nil-null hypothesis significance test tells you nothing about the external reality or any theory of interest that may make useful predictions
                > I don’t understand why/how you want to to interpret them as success cases for NHST.

                Andrew> I think there are applications for which the concept of false positive and false negative, sensitivity and specificity, ROC curves, etc., are relevant. There are actual classification problems. I just don’t think scientific theories are typically best modeled in that way.

                Maybe this is really the crux of our difference, Anon.? You seem to be very theory-oriented, and you seem to be an insightful and creative thinker. Your discussion of male births in India was very interesting. I can understand your disdain at the lack of explanatory power in NHST.

                My own statistical goals are very modest. I appreciate having NHST at my disposal, to tell me that something seems to be commonplace, on whatever metric. Alternatively, testing an appropriate statistical hypothesis may signal that something seems extremely unusual, relative to appropriate expectations. In that case, I understand that it’s up to me to do my own thinking from there, with “possibly why?” questions and orthogonal deductive reasoning.

                Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.

                You’ve definitely made your point, though, which I appreciate and acknowledge. Thank you again for the very helpful discussion :) I’ll close with the following quote:

                “Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis.”
                https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Criticism

              • Anoneuoid says:

                >”Per Andrew’s comment, is it not true that our modern lives are vastly enhanced by an endless number of tests for statistical significance? The CO detector went off. The blood test was positive. Factory QA signaled the batch was out of tolerance. Etc etc. This is not pseudo-science.”

                Please supply a reference you trust describing how one/some/all of these supposed use-cases is actually implemented. I am certain that upon investigation it will turn out to be some method other than NHST (as I defined it here multiple times) or there is no actual evidence of success. Also, note the pitfall of using the “evidence” produced by the NHST procedure to prove the usefulness of NHST.

  29. Tova Perlmutter says:

    Yesterday heard Amy Cuddy and power poses being puffed by Tom Ashbrook on NPR: https://onpoint.wbur.org/2016/02/03/amy-cuddy-power-pose-business.

    I actually didn’t listen to more than a few minutes because every time she said “I’m a scientist, so…” I felt distressed.

Leave a Reply