Skip to content

Hey—here’s a tip from the biology literature: If your correlation is .02, try binning your data to get a correlation of .8 or .9!

Josh Cherry writes:

This isn’t in the social sciences, but it’s an egregious example of statistical malpractice:

Below the abstract you can find my [Cherry’s] comment on the problem, which was submitted as a letter to the journal, but rejected on the grounds that the issue does not affect the main conclusions of the article (sheesh!). These folks had variables with Spearman correlations ranging from 0.02 to 0.07, but they report “strong correlations” (0.74-0.96) that they obtain by binning and averaging, essentially averaging away unexplained variance. This sort of thing has been done in other fields as well.

The paper in question, by A. Diament, R. Y. Pinter, and T. Tuller, is called, “Three-dimensional eukaryotic genomic organization is strongly correlated with codon usage expression and function.” I don’t know from eukaryotic genomic organization, nor have I ever heard of “codon”—maybe I’m using the stuff all the time without realizing it!—but I have heard of “strongly correlated.” Actually, in the abstract of the paper it gets upgraded to “very strongly correlated.”

In the months since Cherry sent this to me, more comments have appeared at the above Pubmed commons link, including this by Joshua Plotkin which shares his original referee report with Premal Shah from 2014 recommending rejection of the paper. Key comment:

Every single correlation reported in the paper is based on binned data. Although it is sometimes appropriate to bin the data for visualization purposes, it is entirely without merit to report correlation coefficients (and associated p-values) on binned data . . . Based on their own figures 3D and S2A, it seems clear that their results either have very small effect or do not at hold at all when analyzing the actual raw data.

And this:

Moreover, the correlation coefficients reported in most of their plots make no sense whatsoever. For instance, in Fig1B, the best-fit regression line of CUBS vs PPI barely passes through the bulk of the data, and yet the authors report a perfect correlation of R=1.

A follow-up comment by Plotkin has some numbers:

In the paper by Diament et al 2014, the authors never reported the actual correlation (r = 0.022) between two genomic measurements; instead they reported correlations on binned data (r = 0.86).

I think we can all agree that .022 is a low correlation and .86 is a high correlation.

But then there’s this from Tuller:

In Shah P, 2013 Plotkin & Shah report in the abstract a correlation which is in fact very weak (according to their definitions here), r = 0.12, without controlling for relevant additional fundamental variables, and include a figure of binned values related to this correlation. This correlation (0.12) is reported in their study as “a strong positive correlation”.

So now I’m thinking that everyone in this field should just stop calling correlations high or low or strong or weak. Better just to report the damn number.

Tuller also writes:

If the number of points in a typical systems biology study is ~300, the number of points analyzed in our study is 1,230,000-fold higher (!); a priori, a researcher with some minimal experience in the field should not expect to see similar levels of correlations in the two cases. Everyone also knows that increasing the number of points, specifically when dealing with non trivial NGS data, also tends to very significantly decrease the correlation.

Huh? I have no idea what they’re talking about here.

But, in all seriousness, it sounds to me like all these researchers should stop talking about correlation. If you have a measure that gets weaker and weaker as your sample size increases, that doesn’t seem like good science to me! I’m glad that Cherry put in the effort to fight this one.

Comment on network analysis of online ISIS activity

Two different journalists asked me about this paper, “New online ecology of adversarial aggregates: ISIS and beyond,” N. F. Johnson, M. Zheng, Y. Vorobyeva, A. Gabriel, H. Qi, N. Velasquez, P. Manrique, D. Johnson, E. Restrepo, C. Song, S. Wuchty, a paper that begins:

Support for an extremist entity such as Islamic State (ISIS) somehow manages to survive globally online despite considerable external pressure and may ultimately inspire acts by individuals having no history of extremism, membership in a terrorist faction, or direct links to leadership. Examining longitudinal records of online activity, we uncovered an ecology evolving on a daily time scale that drives online support, and we provide a mathematical theory that describes it. The ecology features self-organized aggregates (ad hoc groups formed via linkage to a Facebook page or analog) that proliferate preceding the onset of recent real-world campaigns and adopt novel adaptive mechanisms to enhance their survival. One of the predictions is that development of large, potentially potent pro-ISIS aggregates can be thwarted by targeting smaller ones.

I sent my response to the journalists, but then I thought that some of you might be interested too, so here it is:

The paper seems kinda weird. Figure 1 has 10 groups and but you have to contact the authors to find out the names of the groups? They say, “Because the focus in this paper is on the ecosystem rather than the behavior of any individual aggregate, the names are not being released.” But (a) there’s room on the graph for the names, and (b) it would be easy to post the names online. It creeps me out: maybe the FBI is tracking who emails them for the names? I have no idea but it seems strange to withhold data and make readers ask them for it. If the data were actually secret for national security reasons, that I’d understand. But if you’re going to release the data to anyone who asks, why not just post online?

Anyway, that all put me in a bad mood, also the little image inserted in figure 1 adds zero information except to show that the authors had access to a computer program that makes these umbrella-like plots.

Beyond this, they talk about a model for the shark-fin shape, but this just seems a natural consequence of the networks being shut down as they get larger and more noticeable.

On the plus side, the topic is obviously important, the idea of looking at aggregates seems like a good one, and I’m sure much can be learned from these data. I think it would be more useful for them to have produced a longer, open-ended report full of findings. The problem with the “Science magazine” style of publishing is that it encourages researchers to write these very short papers that are essentially self-advertisements. I guess in that sense this paper might be useful in that it could attract media attention and maybe the authors have a longer report with more data explorations. Or it might be that there’s useful stuff in this Science paper and I’m just missing it because I’m getting lost in their model. My guess is that the most valuable things here are the descriptive statistics. If so, that would be fine. There’s a bit of a conflict here in that for Science magazine you’re supposed to have discoveries but for fighting Isis there’s more of a goal of understanding what is happening out there. In theory there is some benefit from modeling (as the authors note, one can do simulation of various potential anti-ISIS strategies) but I don’t think they’re really there yet.

I’m guessing Cosma Shalizi and Peter Dodds would have something to say here, as they are more expert than I am in this sort of physics-inspired network analysis.

P.S. Here’s one of the news articles. It’s by Catherine Caruso and entitled, “Can a Social-Media Algorithm Predict a Terror Attack?”, which confused me because I didn’t notice anything about prediction in that research article. I mean, sure, they said, “Our theoretical model generates various mathematically rigorous yet operationally relevant predictions.” But I don’t think they were actually predicting terror attacks. But maybe I’m missing something.

P.P.S. The New York Times also ran a story on this one. They didn’t interview me, instead they interviewed a couple of actual experts, both of whom expressed skepticism. Even so, though, the writer of the article, Pam Belluck, managed to spin the research project in a positive way. I think that’s just the way things go with science reporting. If it’s not a scandal of some sort, the press likes the idea as scientist as hero.



Paul Alper pointed me to this news article with the delightful url, “superstar-doctor-fired-from-swedish-institute-over-research-lies-allegations-windpipe-surgery.” Also here. It reminded me of this discussion from last year.

Damn, those windpipe surgeons are the worst. Never should trust them. The pope should never have agreed to officiate at this guy’s wedding.

You’ll never guess what I’ll say about this paper claiming election fraud! (OK, actually probably you can guess.)

Glenn Chisholm writes:

As a frequent visitor of your blog (a bit of a long time listener first time caller comment I know) I saw this particular controversy:


Very superficial analysis:

and was interested if I could get you to blog on its actual statistic foundations, this particular paper has at least the appearance of respectability due to the institutions involved. As a permanent resident not a citizen I tend to be a little more abstracted from American politics, although I do enjoy the vigor in which it is played in the US. Based on some of the vitriol I have seen from all sides you may not want to touch this with a ten foot pole, but I thought it would be interesting to get a political scientist with credibility analysis out there publicly. There does seem to be this undercurrent in this cycle where a group genuinely feel that somehow they were disenfranchised, my take is Occam’s Razor applies here and no manipulation occurred, but my opinion is worthless.

My reply:

I don’t find this paper at all convincing. There can be all sorts of differences between different states, and that pie chart is a joke in that it hides the between-state variation within each group of states. You never know, fraud could always happen, but this is essentially zero evidence. Not that you’d need an explanation as to why a 74-year-old socialist fails to win a major-party nomination in the United States.

Also, regarding your comment about the institutions involved, I wouldn’t take the credentials so seriously; this Stanford guy is not a political scientist. Not that this means it’s necessarily wrong, but he’s not coming from a position of expertise. He’s just a guy, the Stanford affiliation gives no special credibility.

Objects of the class “Pauline Kael”

A woman who’s arguably the top person ever in a male-dominated field.

Steve Sailer introduced the category and entered Pauline Kael (top film critic) as its inaugural member. I followed up with Alice Waters (top chef/restaurateur), Mata Hari (top spy), Agatha Christie (top mystery writer), and Helen Keller (top person who overcame a disability; sorry, Stevie Wonder, you didn’t make the cut).

We can distinguish this from objects of the class “Amelia Earhart”: a woman who’s famous for being a woman in a male-dominated field. There are lots of examples of this category, for example Danica Patrick.

Objects of the class “Objects of the class”

P.S. Some other good ones from that thread:

Queen Elizabeth 1 (tops in the “English monarchs” category)

Margaret Mead (anthropologists)

We also discussed some other candidates, including Marie Curie, Margaret Thatcher, Mary Baker Eddy, Ellen Willis, and my personal favorite on this list, Caitlin Jenner (best decathlete of all time).

“Smaller Share of Women Ages 65 and Older Are Living Alone,” before and after age adjusment

After noticing this from a recent Pew Research report:


Ben Hanowell wrote:

This made me [Hanowell] think of your critique of Case and Deaton’s finding about non-Hispanic mortality.

I wonder how much these results are driven by the fact that the population of adults aged 65 and older has gotten older with increasing lifespans, etc etc.

My collaborator Jonathan Auerbach looked into this and found that, in this case, age adjustment doesn’t make a qualitative difference. Here are some graphs:

Percent of adults over age 65 who live alone, over time, for women and men, with solid lines showing raw data and dotted lines showing populations adjusted to the age x sex composition from 2014:


The adjustment doesn’t change much. To get more understanding, let’s break things up by individual ages. Here are the raw data, with each subgraph showing the curves for the 5 years in its age category:


Some interesting patterns here. At the beginning of the last century, a bit less than 10% of elders were living alone, with that percentage not varying by sex or age. Then big changes in recent years.

We learn a lot from these individual-age curves than we did from the simple aggregate.

In this case, age adjustment did not do much, but age disaggregation was useful.

Jonathan then broke down the data by age and ethnicity (non-Hispanic Black, Hispanic, and non-Hispanic White; I guess there weren’t enough data on Other, going back to 1900):


To see people blogging about it in real time — that’s not the way science really gets done. . . .

Seriously, though, it’s cool to see how much can be learned from very basic techniques of statistical adjustment and graphics.

And maybe we made some mistakes. If so, putting our results out there in clear graphical form should make it easier for others to find problems with what we’ve done And that’s what it’s all about. We only spent a few hours on this—we didn’t spend a year, sweating out every number, sweating out over what we were doing—but we’d still welcome criticism and feedback, from any source. That’s a way to move science forward, to move the research forward.

P.S. I sent this to demographer Philip Cohen who wrote:

There are two issues with 65+: 1 is they are getting older, 2 is they are getting younger since the baby boomers suddenly hit 65.

Here’s a post by Cohen on this from a couple months ago.

The answer is the Edlin factor

Garnett McMillan writes:

You have argued about the pervasive role of the Garden of Forking Paths in published research. Given this influence, do you think that it is sensible to use published research to inform priors in new studies?

My reply: Yes, I think you can use published research but in doing so you should try to account for selection and publication bias. It’s not always clear exactly how to do this. As the saying goes, further research is needed.

They threatened to sue Mike Spagat but that’s not shutting him up

Mike Spagat, famous for blowing the whistle on that Iraq survey (the so-called Lancet study) ten years ago, writes:

I’ve just put up the story about how a survey research company threatened to sue me to keep me quiet. I’ve also put up a lot of data that readers can analyse if they want to get into this really interesting issue.

This time it’s another set of surveys from Iraq that concerns Spagat.

Here’s the background:

The polling companies D3 Systems and KA Research Limited have fielded a number of surveys in Iraq since the invasion.

Some were for internal use only within the US government and must have informed US diplomatic and military policy.

Some got major public exposure, even winning an Emmy Award for ABC news.

Hardly any of the detailed micro data have been released for inspection.

Two datasets done by PIPA of the University of Maryland have been in the public domain and I have them.

Steven Koczela of The MassINC Polling Group obtained four datasets from the Broadcasting Board of Governors through a FOIA.

The State Department has ignored a similar FOIA request (I didn’t know that ignoring a FOIA was an option.)

I [Spagat] have found evidence that many of the data in the six surveys I have are fabricated.

See my [Spagat’s] conference paper for a short summary of this evidence. . . .

I sent the paper to D3 for comment . . .

And here’s what happened next (see page 16 of Spagat’s slides):

Re: D3 Systems, Inc. v Michael Spagat and Steve Koczela:

This firm represents D3 Systems, Inc. (“Our Client” or “D3”). Our client has retained us to commence litigation against you and any entity with which you are affiliated (including .) seeking compensation for, and equitable relief to terminate, your distribution and publication of false and defamatory statements about D3 to its clients and others.

Accordingly, WE HEREBY DEMAND, on and in behalf of our client that you:

1. IMMEDIATELY CEASE AND DESIST further delivery, dissemination, distribution and publication of the Subject Document and any of its content.

2. Deliver to this office within 8 days of the date of this letter a complete and accurate list of all parties to whom you have delivered, or requested publication of, all or any portion of the Subject Document. We note that you have refused to provide this information to our client and we warn you that your continued refusal to do so is not only further evidence of your intention to interfere with our client’s business relations, but also serves to exacerbate the financial damage your actions have inflicted on our client and therefore to increase the amount of compensatory and punitive monetary damages that our client will seek from you.

3. Deliver to this office within 8 days of the date of this letter a list of all individuals and organizations with whom you communicated, or from whom you received information, in connection with the preparation of the subject document.

This letter does not constitute an admission, waiver, agreement or forbearance of any kind. We hereby reserve all of our client’s rights and remedies.

Spagat continues:

What happened next?

1. My College spent a lot of money on lawyers.

2. We decide to write back asking if D3 can point me to a specific problem in the paper. I will gladly correct any errors.

3. D3 and D3’s lawyer do not write back with any specific suggestions.

It was a bluff.

He summarizes what is known:

Recall that I have only been able analyse six polls – many of the D3/KA Iraq polls remain out of reach, including the State Department ones and a series of polls sponsored by ABC, the BBC and other media organizations.

There is no way to really know how valid these polls without examining the data, however, they must be viewed as under a cloud as long as the data remain hidden.

It is also possible that many of the messages emanating from these polls are broadly accurate even if many of the data are fabricated. . . . maybe the historical record in Iraq has not been badly distorted by polling fabrication.

But we need to have an open and honest assessment of the work of D3/KA there so that we can make this call.

I don’t don’t don’t don’t don’t trust surveys where the data are hidden. It’s not necessarily data fabrication. People can exaggerate, for example reporting 2 responses out of 6 as 33% without giving the sample size, or they can gathering and reporting data with no concern for representativeness, or they misleadingly characterize survey responses. Or people can just be confused. Lots of reason to distrust when they won’t share their data, lots of reasons to be wary of making policy based on such surveys.

P.S. More here and here, about which Spagat writes:

I mostly stress the way that a central estimate (of excess deaths in war) seems to just float upwards unchecked. But, perhaps more interesting is what seems to be a related phenomenon that the (gargantuan) uncertainty interval seems to somehow just shrink down to nothing as the central estimate slides up. It’s as if I start out trying to sell you a car on the basis that you should get 20-40 miles per gallon and by the end of the conversation I’m saying that 33 mpg pretty much the worst case scenario if you drive like a maniac.

Maybe that gas example doesn’t quite capture the situation since in this example the top of the uncertainty interval is 14 times the bottom. If you want to take seriously something that wide you are really forced to supress the uncertainty.

On deck this week

Mon: They threatened to sue Mike Spagat but that’s not shutting him up

Tues: “Smaller Share of Women Ages 65 and Older Are Living Alone,” before and after age adjusment

Wed: Objects of the class “Pauline Kael”

Thurs: research-lies-allegations-windpipe-surgery

Fri: Hey—here’s a tip from the biology literature: If your correlation is .02, try binning your data to get a correlation of .8 or .9!

Sat: The NYT inadvertently demonstrates how not to make a graph

Sun: How an academic urban legend can spread because of the difficulty of clear citation

How to design a survey so that Mister P will work well?

Screen Shot 2016-06-11 at 11.26.51 PM

Barry Quinn writes:

I would like some quick advice on survey design literature, specifically any good references you would have when designing a good online survey to allow for some decent hierarchal modeling?

My quick response is that during the opening you should already be thinking about the endgame. In this case, the endgame is that you want results you believe, and that will stand up to outside criticism. The “middlegame,” then, will involve adjusting sample to match population, using Mister P or whatever. And the opening will include the following four steps:

1. Try your best to reach everyone in the target population: minimize selection process in the process of contact and response.

2. Take care when measuring the outcomes of interest. Let your measurements match your goals. For example, if you are interested in individual changes over time, try to get multiple measurements on the same people.

3. Gather enough background information on respondents so you can do a good job adjusting. So gather demographic information and also other relevant variables (such as party identification and past votes, if this is a political survey).

4. If you’re going to do Mister P, then get good group-level predictors. Hierarchical modeling is most effective when it is used to smooth toward a model that predicts well; it doesn’t work magic when you don’t have good group-level information.

In answer to your direct question: No, I don’t know that this is addressed in the literature. Unless you count this post as part of the survey design literature, in which case I guess I could point here.

Log Sum of Exponentials for Robust Sums on the Log Scale

This is a public service announcement in the interest of more robust numerical calculations.

Like matrix inverse, exponentiation is bad news. It’s prone to overflow or underflow. Just try this in R:

> exp(-800)
> exp(800)

That’s not rounding error you see. The first one evaluates to zero (underflows) and the second to infinity (overflows).

A log density of -800 is not unusual with the log likelihood of even a modestly sized data set. So what do we do? Work on the log scale, of course. It turns products into sums, and sums are much less prone to overflow or underflow.

log(a * b) = log(a) + log(b)

But what do we do if we need the log of a sum, not a product? We turn to a mainstay of statistical computing, the log sum of exponentials function.

log(a + b) = log(exp(log(a)) + exp(log(b))

           = log_sum_exp(log(a), log(b))

We use a little algebraic trick to prevent overflow and underflow while preserving as many accurate leading digits in the result as possible:

log_sum_exp(u, v) = max(u, v) + log(exp(u - max(u, v)) + exp(v - max(u, v)))

The leading digits are preserved by pulling the maximum outside. The arithmetic is robust becuase subtracting the maximum on the inside makes sure that only negative numbers or zero are ever exponentiated, so there can be no overflow on those calculations. If there is underflow, we know the leading digits have already been returned as part of the max term on the outside.


We use log-sum-of-exponentials extensively in the internal C++ code for Stan, and it also pops up in user programs when there is a need to marginalize out discrete parameters (as in mixture models or state-space models). For instance, if we have a normal log density function, we can compute the mixture density with mixing proportion lambda as

log_sum_exp(log(lambda) + normal_log(y, mu[1], sigma[1]),
            log1m(lambda) + normal_log(y, mu[2], sigma[2]));

The function log1m is used for robustness; it’s value is defined algebraically by

log1m(u) = log(1 - u)

But unlike the naive calculation, it won’t underflow to 0 when u is close to 1 and 1 – u overflows to 1. Try this in R:

log(1 - 10e-20)

log1m isn’t built in, but log1p is and negation doesn’t lose us any digits. The subtraction in the first expression overflows to 1 so the log returns 0 (thus the overall expression underflows). But the second case returns the correct non-zero result.

What goes on under the hood is that different approximations are used to the log function depending on the value of u, typically using lower-order series expansions when standard algorithms are prone to underflow or overflow.

No, I’m not convinced by this one either.


Alex Gamma sends along a recently published article by Carola Salvi, Irene Cristofori, Jordan Grafman, and Mark Beeman, along with the note:

This might be of interest to you, since it’s political science and smells bad.

From The Quarterly Journal of Experimental Psychology: Two groups of 22 college students each identified as conservatives or liberals based on two 7-point Likert scales were asked whether they solved a word association task by insight or by analysis. Statistically significant group x solving strategy interaction in ANOVA (Fig. 2), and the findings are declared as “providing novel evidence that political orientation is associated with problem-solving strategy”. This clearly warrants the paper’s title “The politics of insight.”

I replied: N=44, huh?

To which Gamma wrote:

About the N=44, the authors point out that they matched students pairwise on their scores in the two Likert tasks, but I’m not sure that makes their conclusions more trustworthy. From the paper:

“Our final sample consisted of 22 conservatives who were matched with 22 liberal participants. For example, each participant who scored 7 on the conservatism scale and 1 on the liberalism scale was matched (on age and ethnicity) with another participant who scored 7 on the liberalism scale and 1 on the conservatism scale. Each participant who scored 7 on the conservatism scale and 2 on the liberalism scale was matched (on age and ethnicity) with another participant who scored 7 on the liberalism scale and 2 on the conservatism scale and so on. The final sample of 44 participants was balanced for political orientation and ethnicity.”

I see no reason to believe these results. That is, in a new study I have no particular expectation that these results would replicate. They could replicate—it’s possible—I just don’t find this evidence to be particularly strong.

From their Results section:

Screen Shot 2016-03-21 at 8.00.51 AM

Screen Shot 2016-03-21 at 8.01.01 AM

Screen Shot 2016-03-21 at 8.01.16 AM

As I typically say when considering this sort of study, I think the researchers would be better off looking at all their results rather than sifting based on statistical significance.

Am I being too hard on these people?

But wait! you say. Isn’t almost every study plagued by forking paths? And I say, sure, that’s why I don’t take a p-value as evidence here. If you have some good evidence, fine. If all you have is a p-value and a connection to a vague theory, then no, I see no reason to take it seriously.

And, just to say this one more time, I’m not recommending a preregistered replication here. If they want to, fine, but it would seem to me to be a waste of time.

Stan makes Euro predictions! (now with data and code so you can fit your own, better model)

Leonardo Egidi writes:

Inspired by your world cup model I fitted in Stan a model for the Euro Cup which start today, with two Poisson distributions for the goals scored at every match by the two teams (perfect prediction for the first match!).

Data and code are here.

Here’s the model, and here are the (point) predictions:

Screen Shot 2016-06-10 at 10.52.22 PM

Screen Shot 2016-06-10 at 10.52.32 PM

Screen Shot 2016-06-10 at 10.52.38 PM

Screen Shot 2016-06-10 at 10.52.47 PM

I didn’t look at the model but at first glance I think the priors are a bit vague. Maybe it doesn’t really matter, though. Anyway, I’m glad to see that France will make it to the semifinals. (I root for France whenever the U.S. and Guatemala aren’t in the picture.)

Maybe Leo can also send along the Stan code and the data?

P.S. Just to be clear: this is not my model. Leo gets all the credit/blame.

P.P.S. Leo sent me the data and code along with the following note:

I would like to re-estimate the model for the second stage of the euro cup after the first groups results.

In euro2016.R I inserted some comments for explaining my choices. I just need to point out the following:

– The “final” model I used is euro2016Matt.stan. Here I am using some – let me say – informative priors for parameters c and d which regard the attack and the defense score. By soccer experience, I notice that over the years the best teams are likely to perform quite well in the first matches, and that is the reason for my priors with a lower bound of 0.5.

– As pointed out by someone of your readers – I already answered in your blog – there is no a system yet for deciding the winner after the regular time. Quite now, the predictions for the second stage are less reliable.

Betancourt Binge (Video Lectures on HMC and Stan)

Michael Betancourt on HMC via YouTube

Even better than binging on Netflix, catch up on Michael Betancourt’s updated video lectures, just days after their live theatrical debut in Tokyo.

His previous videos have received very good reviews and they’re only getting better. The second season brings new material and a healthy presenter (he had a nasty cold the last time around).

Racial classification sociology controversy update

The other day I posted on a controversy in sociology where Aliya Saperstein and Andrew Penner analyzed data from the National Longitudinal Survey of Youth, coming to the conclusion that “that race is not a fixed characteristic of individuals but is flexible and continually negotiated in everyday interactions,” but then Lance Hannon and Robert DeFina argued that those claims were mistaken, the product of a data analysis that did not properly account for measurement error. I did not check out the details in either of the two sides’ analysis, but Hannon and DeFina’s criticism made sense to me, in that it did seem to me that measurement error could cause the artifacts of which they wrote.

As I wrote in my post, the story so far seems consistent with the common pattern in which a research team finds something interesting and then, rather than trying to shoot down their own idea, they try to protect it, running robustness studies with the goal of confirming their hypothesis, responding to criticisms defensively, and so forth. Just par for the course, and it’s a good thing that the journal Sociological Science was there to publish Hannon and DeFina’s article.

Anyway, we had some good blog discussion, including this detailed comment by Aaron Gullickson:

Given that I [Gullickson] am a co-author with Saperstein on another paper on racial fluidity (in this case switching between “black” and “mulatto in the 19th century US south), I have given some serious thought to what exactly is going on here to produce bias, since it is not particularly well spelled out by Hannon and LaFina. Based on simulations, I believe the potential for bias from reporting error exists in the models that control for prior racial identification (models 1-3 above) but not in the fixed effects model (model 4). Admittedly, the fixed effect estimate is smaller here and if you look at some of S&P’s other work, it tends to be less robust across different dependent variables and model specifications, but I think it is too early to start throwing out the results entirely. Its also difficult to know what proportion of switchers are actually data errors (although I have seen a draft of S&P’s AJS response and I think they have some clever ways of thinking about it, but I won’t steal their thunder).

Here is a replicable simulation in R outlining the source of the bias:

I don’t quite understand the “steal their thunder” thing, as Saperstein and Penner could post a preprint of their American Journal of Sociology response right now? No need to wait until formal publication, why not get the response out right away, I’d think.

Anyway, I sent a message to Hannon asking if he had any response to the above comment by Gullickson. Here’s how Hannon replied:

Gullickson seems to be making an argument that is consistent with our call for greater use of fixed-effects models in this literature. I don’t agree with his suggestion that accounting for measurement error in racial fluidity studies means a total focus on missing completely at random because “Anything else gets into the tricky question of what measurement error really means anyway on a variable with no real true response.” I think non-random measurement error matters too in this context. More generally, I hope that researchers in this area eventually move away from the idea that when an effect cannot be explained by random error, it must be due to their particular hypothesized causal mechanism (racial priming and the internalization of stereotypes). There are ways to more directly test their mechanism.

I’m happy that the discussion of this paper is centering on measurement issues. As I’ve written many times, I think the importance of careful measurement is underrated in statistics and in social science.

Also, Hannon said one thing I like so much I’ll repeat it right here:

More generally, I hope that researchers in this area eventually move away from the idea that when an effect cannot be explained by random error, it must be due to their particular hypothesized causal mechanism.

Yes. Rejection of the straw-man null should not be taken as acceptance of the favored alternative.

A Primer on Bayesian Multilevel Modeling using PyStan

Chris Fonnesbeck contributed our first PyStan case study (I wrote the abstract), in the form of a very nice Jupyter notebook. Daniel Lee and I had the pleasure of seeing him present it live as part of a course we were doing at Vanderbilt last week.

A Primer on Bayesian Multilevel Modeling using PyStan

This case study replicates the analysis of home radon levels using hierarchical models of Lin, Gelman, Price, and Kurtz (1999). It illustrates how to generalize linear regressions to hierarchical models with group-level predictors and how to compare predictive inferences and evaluate model fits. Along the way it shows how to get data into Stan using pandas, how to sample using PyStan, and how to visualize the results using Seaborn.

As an added bonus, if you follow the link to the source repo on GitHub, you’ll find a Gaussian process case study. I haven’t even had time to look at it yet, but if it’s as nice as this radon study, it’ll be well worth checking out.

P.S. If you’re wondering what one of the core PyMC developers was doing writing PyStan examples, it was because he invited us to teach a course on RStan at Vanderbilt to his biostatistics colleagues who didn’t want to learn Python. It was extremely generous of him to put promoting good science ahead of promoting his own software! Part of our class was on teaching Bayesian methods and how to code models in Stan, and Chris offered to do some case studies, which is what Andrew usually does when he’s the third instructor. Chris said he tried RStan, but then bailed and went back to Python where he could use familiar and powerful Python tools like pandas and numpy and seaborn. It’s hard to motivate learning a whole new language and toolchain just to write one example. The benefit to us is that we now have a great PyStan example. Thanks, Chris!

“What is a good, convincing example in which p-values are useful?”

A correspondent writes:

I came across this discussion of p-values, and I’d be very interested in your thoughts on it, especially on the evaluation in that thread of “two major arguments against the usefulness of the p-value:”

1. With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.

2. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.

P.S. If you happen to share this, I’d prefer not to be identified. Thanks.

This seems like a pretty innocuous question so it’s not clear why he wants to keep his identity secret, but whatever.

I followed the link above and it was titled, “What is a good, convincing example in which p-values are useful?”

I happen to have written on the topic! See page 70 of this article, the section entitled Good, Mediocre, and Bad P-values.

I also blogged on this article last year; see here and here.

Donald Trump and Joe McCarthy

He built . . . a coalition of the aggrieved—of men and women not deranged but affronted by various tendencies over the previous two or three decades . . .

That’s political reporter Richard Rovere in his 1958 classic, “Senator Joe McCarthy.” I hate to draw an analogy between McCarthy and Donald Trump because it seems so obvious . . . but I happened to be reading Rovere’s book and came across so many passages that reminded me of Trump, I had to share.

Here are a few:

He was a fertile innovator, a first-rate organizer and galvanizer of mobs, a skilled manipulator of public opinion, and something like a genius at that essential American strategy: publicity.

Intimations, allegations, accusations of treason were the meat upon which this Caesar fed. He could never swear off.

The Gallup Poll once tested his strength in various occupational groups and found that he had more admirers among manual workers than in any other category—and fewest among business and professional people.

Because McCarthyism had no real grit and substance as a doctrine and no organization, it is difficult to deal with as a movement. Adherence was of many different sorts. There were those who accepted McCarthy’s leadership and would have been happy to see him President. There were others who were indifferent to his person but receptive to what he had to say about government. There were others still who put no particular stock in what he had to say and even believed it largely nonsense but felt that he was valuable anyway.

McCarthy drew into his following most of the zanies and zombies and compulsive haters who had followed earlier and lesser demagogues in the fascist and semifascist movements of the thirties and forties. . . . But this was really the least part of it. McCarthy went far beyond the world of the daft and the frenzied—or, to put the matter another way, that world was greatly enlarged while he was about.

In his following, there were many people who counted for quite a bit in American life—some because of wealth and power, some because of intelligence and political sophistication. He was an immediate hit among the Texas oilmen, many of whom were figures as bizarre and adventurous in the world of commerce and finance as he was in the world of finance. . . . And there were intellectuals and intellectuals manque whose notions of Realpolitik had room for just such a man of action as McCarthy.

L’etat, c’est moi, legibus solutus, and I Am the Law. He and the country were one and the same, synonymous and interchangeable.

We see echoes of this, not merely in Trump’s own statements, but from his supporters. For example, this from Scott Adams (sorry!): “Trump supporters don’t have any bad feelings about patriotic Americans such as myself,” which works pretty well if by “patriotic Americans” you exclude various patriotic Americans such as Barack Obama, Hillary Clinton, Gonzalo Curiel, and various news reporters.

Back to Rovere:

It was a striking feature of McCarthy’s victories, of the surrenders he collected, that they were mostly won in battels over matters of an almost comic insignificance. His causes celebres were causes ridicules. . . .

Yet the antic features of McCarthyism were essential ones. For McCarthyism was, among other things, but perhaps foremost among them, a headlong flight from reality. It elevated the ridiculous and ridiculed the important. It outraged common sense and held common sense to be outrageous. It confused the categories of form and value. It made sages of screwballs and accused wise men of being fools. It diverted attention from the moment and fixed it on the past, which it distorted almost beyond recognition.

On the ravages of demagogy and its flight from reality, Thucydides wrote:

The meaning of words had no longer the same relation to things but was changed by them at they thought proper. Reckless daring was held to be courage, prudent delay was the excuse of a coward; moderation was the disguise of unmanly weakness; to know everything was to do nothing. Frantic energy was the true quality of a man. . . . He who succeeded in a plot was deemed knowing, but a still greater master in craft was he who detected one.

McCarthy, then, was of the classic breed. For all the black arts that he practiced, his natural endowments and his cultivated skills were of the very highest order. His tongue was loose and always wagging; he would say anything that came into his head and worry later, if at all, about defending what he had said.

And this:

There has never been the slightest reason to suppose that he took what he said seriously or that he believed any of the nonsense he spread.

He was a vulgarian by method as well as, probably, by instinct. . . . If he did not dissemble much, if he did little to hide from the world the sort of human being he was, it was because he had the shrewdness to see that this was not in his case necessary. . . . In general, the thing he valued was his reputation for toughness, ruthlessness, even brutality. . . . And this sort of thing was always well received by his followers.

While other politicians would seek to conceal a weakness for liquor or wenching or gambling, McCarthy tended to exploit, even to exaggerate, these wayward tastes. He was glad to have everyone believe he was a drinker of heroic attainments, a passionate lover of horseflesh, a Clausewitz of the poker table, and a man to whom everything presentable in skirts was catnip. (When a good-looking woman appeared as committee witness, McCarthy, leering, would instruct counsel “to get her telephone number for me” as well as the address for the record.)

And we’re still only on page 52.

The characteristics that Trump particularly seems to share with McCarthy are boastfulness and self-focus; willingness to boldly lie about important things and, perhaps more important, escalate rather than backing down after the lie is caught; a willingness to attack respected figures; and a fundamental frivolousness, a sense that they are not taking all this very seriously.

There are differences, the biggest being, I think, that McCarthy had a lot of political power while Trump has none. One of the most chilling things described in Rovere’s book is how much influence the senator from Wisconsin had in government operations and foreign policy.

Here’s Rovere: “One of his most striking instruments was a secret seditionist cabal he had organized within the government. This was a network of government servants and members of the armed forces (“the Loyal American Underground,” some of the proud, defiant members called themselves) who, in disregard for their oaths of office and the terms of their contracts with the taxpayers, reported directly to McCarthy and gave him their first loyalty.”

I don’t think Trump has that. He’s got a lot of internet blog commenters and maybe the support of Vladimir Putin and the Russian secret service, but I haven’t heard of a network of supporters within the U.S. government.

Another difference is that McCarthy’s thing was communism, whereas Trump’s thing is racism and sexism.

And McCarthy was more popular than Trump. Here’s Rovere:

In January 1954, when the record was pretty well all in and the worst as well as the best was known, the researches of the Gallup Poll indicated that 50 per cent of the American people had a generally “favorable opinion” of him and felt that he was serving the country in useful ways. Twenty-one per cent drew a blank—“no opinion” The conscious, though not necessarily active, opposition—those with an “unfavorable opinion”—was 29 per cent. A “favorable opinion” did not make a man a McCarthyite, and millions were shortly to revise their view to his disadvantage. But an opposition of only 29 per cent is not much to count on, and it was small wonder that his contemporaries feared him.

In contrast, Trump is viewed favorably by 34% of survey respondents and unfavorably by 58%—I just looked it up. This does not guarantee a general election loss—Hillary Clinton’s favorable/unfavorable numbers are 37% and 56%, which is not much better—but it is a contrast with the earlier demagogue.

In retrospect, I suppose McCarthy had to have been that popular, in that his national following was the source of his power, and, without it, his fellow senators would not have supported him for so long.

By pointing out these striking parallels (and some differences) between McCarthy and Trump, I do not mean to imply that Trump is the only modern politician to share certain of McCarthy’s attitudes and behaviors.

P.S. I googled *Trump McCarthy* to see what else was out there. I came across this from James Downie who emphasizes that Trump, like McCarthy, will just make up numbers and use them to get headlines.

Also this ridiculous (in retrospect) article, “The New McCarthyism of Donald Trump,” from Peter Beinart, which begins:

Pundits are pretty sure that Donald Trump has “jumped the shark.” “Mr. Trump’s candidacy probably reached an inflection point on Saturday after he essentially criticized John McCain for being captured during the Vietnam War,” declared The New York Times’ Nate Cohn last weekend. “Republican campaigns and elites quickly moved to condemn his comments—a shift that will probably mark the moment when Trump’s candidacy went from boom to bust.”

If Cohn is right, and I certainly hope he is, Trump’s political career will have followed the same basic arc as that of another notorious American demagogue, Joseph McCarthy. . . .

It was only when McCarthy targeted the United States military that Republicans began taking him on. In late 1953, when McCarthy began investigating alleged communist influence in the Army, the Army counterattacked. . . .

Although it’s too early to declare Trump’s political career over, the last few days resemble McCarthy’s descent in 1953 and 1954. Even before last weekend, Republican elites increasingly viewed him as a political liability. Then, on Saturday, Trump ventured beyond his previous “soft” targets—immigrants, blacks, and President Obama—and claimed John McCain was not really a war hero. Trump’s GOP opponents, who until then had mostly tried to ignore him, pounced.

When Beinart wrote, “the least few days,” that was on July 21, 2015!

What’s so wrong with the above passage is not that Nate Cohn and Peter Beinart made a prediction that happened not to occur, or even that they took Bill Kristol as representative of Republican opinion. What bugs me is that Beinart botched the history. The story is that McCarthy was riding high, then he targeted the military, then he was brought down, but that’s not quite right. Beinart locates McCarthy’s targeting of the military in 1953, but it was two years earlier, in 1951, that McCarthy attacked George Marshall. McCarthy calling Marshall a traitor was a much bigger deal than Trump saying that McCain was not really a war hero—and, sure, lots of people were stunned that McCarthy took that step—but he ascended to his greatest power after the attack on Marshall, and it was years before McCarthy lost power.

Again, I have no crystal ball. As of July 21, 2015, it was perhaps reasonable to think that, by dissing John McCain, Trump had gone too far and that he was doomed. Who’s to say. But it was a misreading of history to think that the analogous action had sent McCarthy down. McCarthy stayed afloat for years after making widely publicized and ridiculous attacks on a prominent military figure.

Social problems with a paper in Social Problems

Here’s the story. In 2010, sociologists Aliya Saperstein and Andrew Penner published in the journal Social Problems a paper, “The race of a criminal record: How incarceration colors racial perceptions,” reporting:

This study extends the conversation by exploring whether being incarcerated affects how individuals perceive their own race as well as how they are perceived by others, using unique longitudinal data from the 1979 National Longitudinal Survey of Youth. Results show that respondents who have been incarcerated are more likely to identify and be seen as black, and less likely to identify and be seen as white, regardless of how they were perceived or identifed previously. This suggests that race is not a fixed characteristic of individuals but is flexible and continually negotiated in everyday interactions.

Here’s their key table:

Screen Shot 2016-06-01 at 3.01.22 PM

Screen Shot 2016-06-01 at 3.01.36 PM

And here’s how they summarize their results:

Our first models estimate the likelihood of identifying as either white (versus all else) or black (versus all else) in 2002, controlling for incarceration history and self-identification in 1979. In both panels, we then introduce our control variables, one set a time, to examine whether or not they account for the main effect of incarceration on racial self-identification. First, we add the interviewer controls as a group, followed by both the interviewer and respondent controls and, finally, respondent fixed effects. Coefficients represent the likelihood of identifying as either white or black, depending on the model. Positive coef cients mean the respondent was more likely to identify as the given race; negative coefficients mean the respondent was less likely to do so.
Table 3 shows that respondents who were incarcerated between 1979 and 2002 are less likely to subsequently identify as white (Panel A, Model 1). This effect holds when interviewer characteristics are included (Panel A, Model 2), and though it is slightly reduced by the inclusion of respondent characteristics, such as income and marital status (Model 3), and respondent fixed effects (Model 4), it remains relatively large and statistically significant (-.043) in the final model.

As the years went on, this work got some publicity (for example, here’s something from NPR) but when I first heard about this finding I was worried about measurement error. In particular, the above regression can only give results because there are people who change their racial classification, and there’s only some subset of people who will do this. (For example, I’ll never be counted as “black” no matter how many times I go to jail.) A key predictor variable in the regression model is self-identificated race at time 1 of the survey. But this variable is itself measured with error (or, we might say, variation). This error could well be correlated with incarceration, and this raises the possibility that the above results are all statistical artifacts of error (or variation) in a predictor.

I made a mental note of this but didn’t do anything about it. Then a few years later I was talking with someone who told me about a recent research project reanalyzing these data more carefully.

The researchers on this new paper, Lance Hannon and Robert DeFina, conclude that the much-publicized Saperstein and Penner findings were erroneous:

We [Hannon and DeFina] replicate and reexamine Saperstein and Penner’s prominent 2010 study which asks whether incarceration changes the probability that an individual will be seen as black or white (regardless of the individual’s phenotype). Our reexamination shows that only a small part of their empirical analysis is suitable for addressing this question (the fixed-effects estimates), and that these results are extremely fragile. Using data from the National Longitudinal Survey of Youth, we find that being interviewed in jail/prison does not increase the survey respondent’s likelihood of being classified as black, and avoiding incarceration during the survey period does not increase a person’s chances of being seen as white. We conclude that the empirical component of Saperstein and Penner’s work needs to be reconsidered and new methods for testing their thesis should be investigated. The data are provided for other researchers to explore.

This new paper appeared in Sociological Science, a new journal that is much more accepting of critical give and take, compared to traditional sociology journals (about which, see here, for example).

At one point, Hannon and DeFina write:

(1) interviewers will arbitrarily switch between white and other when forced to fit certain types of respondents into a Black-White-Other coding scheme (Smith 1997) and (2) people that are unambiguously white are less likely to be subjected to incarceration.

This is similar to my point above—with the big difference, of course, that Hannon and DeFina actually look at the data rather than just speculating.

A defensive response by an author of the original paper?

Sociological Science has a comment section, and Aliya Saperstein, one of the authors of that original paper, replied there. But the reply didn’t impress me, as she pretty much just repeated her findings without addressing the bias problem. Saperstein concluded her reply with the following:

That said, we would be remiss if we did not acknowledge H&D on one point: case 1738 should have been labeled 1728. We regret the error and any confusion it may have caused.

But I think that bit of sarcasm was a tactical error: in their reply Hannon and DeFina write:

Saperstein and Penner admit to one small error. They note that case 1738 is actually 1728. Unfortunately, case 1728 does not match up to their table either. Before noting that we could not find the classification pattern (in our footnote 5), we searched the data for any cases where 7 of 9 pre-incarceration classifications were white. None exist. This case was not simply mislabeled; the date of incarceration is also off by one period. Given the other abnormalities that we uncovered (see, for example, our footnote 11), we encourage Saperstein and Penner to publicly provide the data and code used to produce their tables.

At that point, Saperstein does admit to a coding error and she suggests that the data and code will be available at some point:

We are currently in the process of assembling a full replication package (designed to take people from the publicly available NLSY data all the way through to our AJS tables), and anticipate posting this as part of the website that will accompany my book (the manuscript for which is currently in progress).

At this point I’m tempted to remind everyone that the original paper came out in 2010 so it’s not clear why in 2016 we should still be waiting some indefinite time for the data and code. But then I remember that I’ve published hundreds of papers in the past few decades and in most cases have not even started to make public repositories of data and code. I send people data and code when they ask, but sometimes it’s all such a mess that I recommend that people just re-start the analysis on their own.

So I’m in no position to criticize Saperstein and Penner for data errors. I will suggest, however, that instead of fighting so tenaciously, they instead thank Hannon and DeFina for noticing errors in their work. Especially if Saperstein is writing a book on this material, she’d want to get things right, no? Ted talks and NPR interviews are fine but ultimately we’re trying to learn about the world.

What went wrong?

So, what happened? Here’s my guess. It’s tricky to fit regression models when the predictors are subject to error. Saperstein and Penner did an analysis that seemed reasonable at the time and they got exciting results, the kind of results we love to see in social science: Surprising, at first counterintuitive, but ultimately making sense in fitting into a deeper view of the world. At this point, they had no real desire to look too hard at their analyses. They were motivated to do robustness studies, slice the data in other ways, whatever, but all with the goal of solidifying and confirming their big finding. Then, later, when a couple of sociologists from Villanova University come by with some questions, their inclination is to brush them off. It feels like someone’s trying to take away a hard-earned win based on a technicality. Missing data, misclassifications, an error in case 1738, etc.: who cares? Once a research team has a big success, it’s hard for them to consider the possibility that it may be a castle built on sand. We’ve seen this before.

What went right?

Sociological Science published Hannon and DeFina’s letter, and Saperstein felt it necessary or appropriate to respond. That’s good. The criticism was registered. Even if Saperstein’s response wasn’t all we might have hoped for, at least there’s open discussion. Next step is to read Saperstein and Penner’s forthcoming response in the American Journal of Sociology. I’m not expecting much, but who knows.

But . . .

The original paper had errors, but these were only revealed after Hannon and DeFina did a major research project of their own. This is how science is supposed to work—but if these two researchers hadn’t gone to all this trouble, the original paper would’ve stood. There must have been many people who had reservations similar to mine about the statistical analysis, but these reservations would not have been part of the scholarly literature (or even the online literature, as I don’t think I ever blogged about it).

Eternal vigilance is the price of liberty.

On deck this week

Social problems with a paper in Social Problems

Donald Trump and Joe McCarthy

“What is a good, convincing example in which p-values are useful?”

“How One Study Produced a Bunch of Untrue Headlines About Tattoos Strengthening Your Immune System”

No, I’m not convinced by this one either.

How to design a survey so that Mister P will work well?