Alex Gamma sends along a recently published article by Carola Salvi, Irene Cristofori, Jordan Grafman, and Mark Beeman, along with the note:

This might be of interest to you, since it’s political science and smells bad.

From The Quarterly Journal of Experimental Psychology: Two groups of 22 college students each identified as conservatives or liberals based on two 7-point Likert scales were asked whether they solved a word association task by insight or by analysis. Statistically significant group x solving strategy interaction in ANOVA (Fig. 2), and the findings are declared as “providing novel evidence that political orientation is associated with problem-solving strategy”. This clearly warrants the paper’s title “The politics of insight.”

I replied: N=44, huh?

To which Gamma wrote:

About the N=44, the authors point out that they matched students pairwise on their scores in the two Likert tasks, but I’m not sure that makes their conclusions more trustworthy. From the paper:

“Our final sample consisted of 22 conservatives who were matched with 22 liberal participants. For example, each participant who scored 7 on the conservatism scale and 1 on the liberalism scale was matched (on age and ethnicity) with another participant who scored 7 on the liberalism scale and 1 on the conservatism scale. Each participant who scored 7 on the conservatism scale and 2 on the liberalism scale was matched (on age and ethnicity) with another participant who scored 7 on the liberalism scale and 2 on the conservatism scale and so on. The final sample of 44 participants was balanced for political orientation and ethnicity.”

I see no reason to believe these results. That is, in a new study I have no particular expectation that these results would replicate. They *could* replicate—it’s possible—I just don’t find this evidence to be particularly strong.

From their Results section:

As I typically say when considering this sort of study, I think the researchers would be better off looking at all their results rather than sifting based on statistical significance.

**Am I being too hard on these people?**

But wait! you say. Isn’t almost every study plagued by forking paths? And I say, sure, that’s why I don’t take a p-value as evidence here. If you have some good evidence, fine. If all you have is a p-value and a connection to a vague theory, then no, I see no reason to take it seriously.

And, just to say this one more time, I’m *not* recommending a preregistered replication here. If they want to, fine, but it would seem to me to be a waste of time.

Can someone tell me what (from the above excerpt) “planned pairwise comparisons” means? I am wondering if they conducted t tests based on matched samples vs unmatched. The “matching” here appears to be only on the strength of their conservative or liberal views, so I would not view these as matched in any meaningful sense (I am generally skeptical of matched samples in any case) – if that is what they did.

None of this is meant to distract from the more serious concerns that Andrew raises. But, I am wondering just how far the problems with their analysis goes.

I wondered about that too. I think the ‘paired’ part in the third sentence pertains to solution strategies (analytical vs. insight) within subjects. I think.

Like Garnett, I think the ‘paired’ part refers to their matching of liberal and conservative subjects. The ‘planned’ part refers to if you knew before collecting the data/running the analysis that you were going to make that comparison, versus an unplanned or post-hoc comparison (https://en.wikipedia.org/wiki/Post_hoc_analysis).

In general, planned comparisons use a less stringent correction, and so it’s easier to get a statistically significant result with a planned vs post-hoc correction method. This is another place where it’s easy to fudge your results a bit, by claiming you planned to make a comparison when you actually didn’t until after you had started looking at results.

Supposing I actually care about this particular problem, I’m interested in how one might use these results to establish a prior for a subsequent study. The second to last sentence identifies a political orientation effect size of 7.7 point difference with a standard error of about 3.1.

Someone who would be very surprised by their result might assign a prior centered at zero with SE=4.7, so that they are about 95% certain that the contrast is less than what the authors observed.

I guess that we could take these results at face value and use a hierarchical prior, centering the effect at 7.7 but allowing for extra ‘study-specific’ variability, so that the SE of the effect is sqrt(2z^2 + 7.7^2), and z either has to be stipulated or given an informative prior.

Another prior given by the garden of forking paths could retain the observed SE by identify the reported effect as biased large, so that the true effect has a prior centered at 7.7-B, with a SE or 4.7.

Or you could ignore this study altogether, but that doesn’t seem right. Does it?

…and is there some specific “population” being somehow sampled here? Do they say?

Liberals & Conservatives are mentioned — is that all liberals/conservatives, American lib/consv, college age lib/consv, East Coast lib/consv, psychology majors, white/other, male/female, etc etc.

Liberal/Conservative are inherently vague terms and even tougher to isolate objectively in an undefined population. How do you rationally draw any broad conclusions or associations from such a tiny, casual experiment?

James:

In answer to the question posed in the last sentence of your comment: You can rationally draw the broad conclusion that this is the sort of thing that gets published in psychology journals.

I think student samples can be useful for many types of experiments, but I tend to look with special skepticism when assuming the self-reported political beliefs of college students are like any broader public. Political socialization research isn’t en vogue, but it strongly suggests (along with anyone who has interacted with college students in these contexts) that these are the exact years that people tend to have total mush for brains when it comes to things like this.

If there are inherent differences between liberals and conservatives, they would still turn up, of course. But in such a small sample, I have my doubts. And with the student sample, if we assume the labels are less informative than for the average adult, differences might be explained by other things that correlate with coming to college with a particular ideological affinity.

What would constitute “some good evidence” if not ANOVA?

Also, what’s wrong with N=44? If it had been 400 or 4000, what would you have said? If I saw 44 people with the same property I’d be like “man, looks like there’s a generalizable property.” No?

Lauren:

It’s not 44 people with the same property. It’s 44 people + the data were tortured = statistical significance. When N is tiny, data are noisy, and effects are small, type M and type S errors will be huge; see my recent paper with Carlin for discussion.

That paragraph with the effect sizes and standard deviations hurts to read. Why not just show a bunch of scatter plots of your data and a table?

In the authors’ defense, it could be a journal/editorial requirement to report things that way. I’ve definitely seen it, even if I wasn’t hoping to share that volume of values.

I applied the GRIM test (shameless plug for preprint here: https://peerj.com/preprints/2064/) to this article.

1. “We divided our sample into three groups: Conservatives were defined as participants who scored above 4 on the question of conservative political ideology (16.7%, N = 22). Liberals were defined as participants who scored above 4 on the question of liberal political ideology (59.8%, N = 79). Participants who scored the same number (e.g., 4 neutral) on the questions of conservative and liberal ideology were excluded from the analysis (21.2%, N = 28).”

None of those percentages are consistent with their initial pool size of 129. The three subgroup Ns sum to 129, so I presume that was correctly reported.

2. In Table 2, each participant was ranked on conservatism (1-7) and liberalism (1-7). Since they were exactly matched, the C/L figures for C participants are swapped for L participants. The scores are 5.77 and 2.31. But while 5.77 can be a correctly rounded score when an integer sum is divided by 22, 2.31 cannot. 2.31*22=50.82 abd 51/22=2.3181818, which rounds to 2.32.

So no, I don’t believe these numbers either. At a minimum, the authors have made four unique reporting errors before they got halfway through their Method section.

Nick:

Interesting, what do you think happened? Perhaps they rounded 2.31818 to 2.31? That would be a natural mistake to make, and they might feel that who cares anyway?

Rounding error can be a big deal in practice, though, as it’s one more researcher degrees of freedom that can be used to manufacture “p less than .05,” thus statistical significance, publication, Ted talks, NPR appearances, etc.

Someone who rounds 2.31818 to 2.31 should not be publishing in a scientific journal, and if their attitude is “who cares anyway” then they should be handing back their PhDs. They can’t even get the freaking percentages of participants in each category correct. Why should they believe they can calculate an ANOVA? (Did you notice that the corresponding author’s e-mail address contains a typo? How much quality control went into this article?)

In our investigation of the datasets that the authors provided us for the GRIM project, pretty much the only genuine rounding errors we found were due to an annoying property of SPSS, which tends to round to 3 figures. If the number is 2.444715, which should be correctly rounded to 2.44 (to 2dp), SPSS will round to 2.445 (to 3dp) and the researcher may round that up, which is visually correct except that of course the underlying number might be anywhere between 2.444501 or 2.44549.

Most of the genuine errors we found were due either to screwups, or to unreported exclusions. Here, the authors have apparently taken great care to report the results for exactly 22 participants in each condition on a one-item scale. I don’t see any room for error.

As your correspondent who sent you the article says, this smells bad. I would like to see the dataset here.

I think they must have excluded the rest for having missing values. But I think this whole matching procedure is really strange, usually you would be matching on variables that make the samples similar not ones that make them different.

I just checked the numbers. For the value of 5.77 (the mean liberal/conservative value that was possible, as opposed to 2.31), the SD of 0.68 is not consistent. That is, there is no combination of 1-7 scores that produces a mean that rounds to 5.77 and an SD that rounds to 0.68. The nearest is M=5.7727 SD=0.6853, with the SD rounding to 0.69 to 2dp.

That’s five (seven including duplicates) reporting errors so far.

I’m coming to this very late, but: The reported percentages are what you get if the reported subsample sizes are divided by 132 instead of 129. So perhaps they actually started with 132 people and three of them … disappeared at some early stage?

So they calculated the percentages early on, and then when three participants were dropped, reported the new per-condition numbers but couldn’t be bothered to update the percentages? This strikes me as very sloppy.

Yup, very sloppy. (I was attempting to explain, not to defend.)

Problems solved “with insight”: 29% + /- 12% compared to 21% +/9%… Beyond questioning the appropriateness of p-values here, I question their math. If my null hypothesis is that both numbers are 25%, I can’t reject it with p < 0.05 like they say.

Well, the t test (M=28.7, SD=11.7, N=22 vs M=21, SD=8.9, N=22) does give the numbers they claim…

What is the meaning of the CIs on the effect sizes?

“… this difference was not reliable, F(2, 42) = 0.93; MSE = 0.011, p..250, η2 = .022 (95% CI [3.6, 9.8]).”

“… liberals solved more problems with insight (28.7%, SD =+11.7%) than did conservatives (21%, SD =+8.9%), t(21)=−2.45, p,.05; d= 0.74 (95% CI [−14.3, −11.8]).”

D’oh! I was reading those standard deviations as standard errors.