Skip to content

p=.03, it’s gotta be true!

Howie Lempel writes:

Showing a white person a photo of Obama w/ artificially dark skin instead of artificially lightened skin before asking whether they support the Tea Party raises their probability of saying “yes” from 12% to 22%. 255 person Amazon Turk and Craigs List sample, p=.03.

Nothing too unusual about this one. But it’s particularly grating when hyper educated liberal elites use shoddy research to decide that their political opponents only disagree with them because they’re racist.

Hey, they could have a whole series of this sort of experiment:

– Altering the orange hue of Donald Trump’s skin and seeing if it affects how much people trust the guy . . .

– Making Hillary Clinton fatter and seeing if that somehow makes her more likable . . .

– Putting glasses on Rick Perry to see if that affects perceptions of his intelligence . . .

– Altering the shape of Elizabeth Warren’s face to make her look even more like a Native American . . .

The possibilities are endless. And, given the low low cost of Mechanical Turk and Craig’s List, surprisingly affordable. The pages of Psychological Science PPNAS Frontiers in Psychology are wide open to you. As the man says, Never say no!

P.S. Just to be clear: I’m not saying that the above-linked conclusions are wrong or that such studies are inherently ridiculous. I just think you have to be careful about how seriously you take claims from reported p-values.


  1. Sam says:

    “Exposure to the light complexion condition significantly boosted the likelihood of voting for Obama when compared with both the dark condition alone (+4.6 points, p < .05) and with all the non-light conditions (+ 3.5 points, p < .05)."


    "The predicted probability of voting for Obama increased by 18 percentage points in the light
    condition among respondents with race IAT scores two standard deviations above the mean (high
    anti-black implicit bias), compared to the dark condition"

  2. shravan says:

    Till about 2011 I would not have understood (despite having read Gelman and Hill) what was wrong with using p-values.

    It is gratifying though that being a prof in Stanford or Harvard doesn’t mean one is not clueless. Maybe someone can do an mturk study as to whether profs from brand name unis question their own understanding more, or whether they are more “unskilled but unaware of it” than profs in non brand nam unis.

    Andrew, a juicy bit you forgot to quote from the media report: “Their study, Willer said, is the first to demonstrate a causal link between Tea Party support and racial resentment.”

    • shravan says:

      One thing that bothers me about professors and other experts (including myself) is that just because they know a lot about one subject (their area of expertise) they just assume they know everything about othertopics that they have only a flaky understanding of. After the external validation of 150-309 papers published, a full professorship, perhaps lots of awards, and students who look to you for advice, it is so easy to start thinking you must be really someone great. Then someone points out a mistake. It’s almost impossible at this point to stop and ask: could i be wrong about this?

      The book Superforecasing provides some concrete advice against what I think of as the certainty mindset.

    • shravan says:

      It would have been interesting to have blacks and hispanice as baselines to see if it’s really about the respondent being white. After all, women supposedly also judge other women as less competent than men (not just men).

      • Well, according to the Washington Post article, the results for nonwhite participants suggested a reverse effect (or none at all):

        “Among the 101 participants of other races or ethnicities, by contrast, those who saw the lightened image of Obama were twice as likely to support the tea party as those who saw the darkened image. Because they had fewer subjects of color, Willer and his colleagues couldn’t rule out the possibility that this difference between the randomly assorted groups was due to chance.”

        From the study itself: “Because our hypotheses concern white Americans’ responses to racial threats, our analyses in all studies focus on white participants(see Supplemental Material for analyses of minority respondents).” (I have not yet found a link to the supplemental material.)

        • Shravan says:

          The Supplemental Material is at the end of the paper.

          • Egad! Thank you.

            “In Study 1, two hundred and fifty-five participants identified as white (71.6%), 40 participants as Latino (11.2%), 23 as Asian (6.5%), 16 as black (4.5%), and 22 indicated another or mixed race (6.2%). Minorities showed less support for the Tea Party in the Dark Obama Prime condition (8%) as compared with the Light Obama Prime condition (19%), though this difference was not statistically significant (χ2 (1) = 2.72, p = .15).”

            • Shravan says:

              Is support a binary variable here (1 support, 0 no support)? So we have 6/78 supporting Obama in light condition and 15/78 in dark condition? Isn’t the support from minorities surprisingly low?

              • Support is binary here, but it’s support for the Tea Party, not support for Obama.

                In Study 5, participants were shown a list of Tea Party positions, one of which was “Strong opposition to President Barack Obama”–but actual support of Obama does not come up in any of the five studies, from what I can see.

              • Shravan says:

                So, am I missing something here? I would have fit the model as below, and under the p-value based criterion my conclusion would be different from the authors’. There is less support for the Tea Party in the light condition, surely an exciting result because I did squeeze in here under the 0.05 threshold.


                condition contrasts(condition)
                dark 0
                light 1

                ## sanity check


                > summary(glm(response~condition,family=binomial()))

                glm(formula = response ~ condition, family = binomial())

                Deviance Residuals:
                Min 1Q Median 3Q Max
                -0.6536 -0.6536 -0.4001 -0.4001 2.2649

                Estimate Std. Error z value Pr(>|z|)
                (Intercept) -1.4351 0.2873 -4.995 5.88e-07 ***
                conditionlight -1.0498 0.5129 -2.047 0.0407 *

                Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                (Dispersion parameter for binomial family taken to be 1)

                Null deviance: 123.26 on 155 degrees of freedom
                Residual deviance: 118.68 on 154 degrees of freedom
                AIC: 122.68

                Number of Fisher Scoring iterations: 5

              • Andrew says:


                Ugh! Please use display(), never summary()!

              • shravan says:

                Andrew, been years since I used arm!

              • Andrew says:


                Just fit using stan_glm() then, and the summary is what you’ll want to see, I think.

              • Shravan says:

                Andrew: here you go:



                Median MAD_SD
                (Intercept) -1.5 0.3
                conditionlight -1.1 0.5

            • Shravan says:

              I guess I should have also converted to probabilities:

              dark 0
              light 1

              > muhat exp(muhat)/(1+exp(muhat)) # Intercept + Slope
              [1] 0.06913842
              > muhat0 exp(muhat0)/(1+exp(muhat0)) # Intercept
              [1] 0.1824255

              • TPer says:

                Less support for the Tea Party in the light Obama prime condition, p<.05; how is that a different conclusion from the authors?

              • Shravan says:

                Please correct me if I misunderstood or made some other mistakes here, but Diana quoted the paper as saying: “Minorities showed less support for the Tea Party in the Dark Obama Prime condition (8%) as compared with the Light Obama Prime condition (19%), though this difference was not statistically significant (χ2 (1) = 2.72, p = .15).” I was looking at that. Assuming my numbers are not messed up (quite likely as I am not in top form these days), I was only illustrating the garden of forking paths by making it come out significant.

  3. shravan says:

    I have a lot of time on my hands today!

    Andrew, when you say, make Clinton fatter, does this mean you think she is fat? US obesity stats would not classify her as fat.

  4. Noname says:

    Hi, sorry for dumb question but could you perhaps elaborate more about why those findings are not correct or shouldn’t be taken as serious as they are (at least I’ve got this impression from the article, but I am not that good in those p-value thing). I was thinking maybe about the sample not being representative or there is not enough people so you can generalize those results? Reverse causality can be problem as well? Just guessing. Anyway, thanks for your answer.

    • Shravan says:

      I guess this is a question for Andrew. But in case it isn’t, it’s not so much that the p-values aren’t correct or shouldn’t be taken seriously. They do answer a question. But they answer the wrong question. Longer rant here and two articles with code here.

  5. Shravan says:

    Wow, this paper has even more embarrassing mistakes. Since it’s a draft I hope Willer is reading this blog and will fix the errors before it is published.

    BTW, it would not hurt the reader at all if a draft paper had page numbers.

    The observed power for these items was .465, .492, and .170, respectively, indicating that our sample size was insufficient to consistently find a significant effect of experimental manipulation on these measures.

    The paper computes observed power. Once you know the p-value, observed power has nothing new to offer. I have seen psychologists compute observed power too, I guess nobody got the memo that these two quantities are related. See Hoenig and Heisey on The Abuse of Power.

    • Shravan says:

      To their credit, they did several experiments. But their models are not appropriate for the dependent measure, which is a Likert scale.

      Study 3: White respondents assigned to the Income Gap Closing condition reported greater support for the Tea Party (M = 1.45) than did those participants assigned to the Income Gap Expanding condition (M = 1.23, t(215) = 2.10, p = .037, d = .29).

      We are talking about a difference of 1.45-1.23=.22 here where the dependent measure is a 7 point Likert scale. People think that model assumptions just don’t matter, but they do. You will repeatedly find that violating model assumptions can give you distorted conclusions, e.g., significance where none exists. Maybe Willer needs to read Kruschke’s book on how do this right right. Actually, I think R also allows for more appropriate models for Likert scales than t-tests.

      This problem runs throughout the paper. All the experiments have this issue. In Study 4 the difference is 0.14. Study 5, 0.26. And so on.

      This is a very general problem in the humanities and social sciences. People just ignore model assumptions and just plug data in, and look for the p-value.

      Question: Is there a statistics department in Stanford? What about that political scientist, Jackman? In Willer’s place, I would go to the statisticians or hook up with Jackman and learn how to analyze data from him.

  6. Chris Pounds says:

    Looks like at least one MD in the Alzheimer’s space is also preaching against p-values. See Lon Schneider’s post:

    …p values are probability statements about the randomness of the distributions of the outcomes, but are not measures of magnitude of effect. A less impressionistic and more nuanced approach to interpreting the clinical significance of solanezumab is to examine the effect sizes of the outcomes…

    • Martha (Smith) says:

      Note that Schneider ends his comment (at the very bottom of the linked page) with

      “It will be difficult to judge whether or not there is a clinically meaningful effect or, indeed, whether any compelling solanezumab-responsive subgroups emerge from post hoc analyses, as these subgroups likely will be identified on the basis of low p values and small effects. “

Leave a Reply