Skip to content

Wanna know what happened in 2016? We got a ton of graphs for you.

The paper’s called Voting patterns in 2016: Exploration using multilevel regression and poststratification (MRP) on pre-election polls, it’s by Rob Trangucci, Imad Ali, Doug Rivers, and myself, and here’s the abstract:

We analyzed 2012 and 2016 YouGov pre-election polls in order to understand how different population groups voted in the 2012 and 2016 elections. We broke the data down by demographics and state and found:
• The gender gap was an increasing function of age in 2016.
• In 2016 most states exhibited a U-shaped gender gap curve with respect to education indicating a larger gender gap at lower and higher levels of education.
• Older white voters with less education more strongly supported Donald Trump versus younger white voters with more education.
• Women more strongly supported Hillary Clinton than men, with young and more educated women most strongly supporting Hillary Clinton.
• Older men with less education more strongly supported Donald Trump.
• Black voters overwhelmingly supported Hillary Clinton.
• The gap between college-educated voters and non-college-educated voters was about 10 percentage points in favor of Hillary Clinton
We display our findings with a series of graphs and maps. The R code associated with this project is available at

There’s a lot here. I mean, a lot. 44 displays, from A:

to Z:

And all sorts of things in between:


The New England Journal of Medicine wants you to “identify a novel clinical finding”

Mark Tuttle writes:

This is worth a mention in the blog.

At least they are trying to (implicitly) reinforce re-analysis and re-use of data.

Apparently, some of the re-use efforts will be published, soon.

My reply: I don’t know enough about medical research to make any useful comments here. But there’s one bit that raises my skepticism: the goal is to “use the data underlying a recent NEJM article to identify a novel clinical finding that advances medical science.”

I’m down on the whole idea that the role of statistics and empirical work is to identify novel findings. Maybe we have too much novelty and not enough reproducibility.

I’m not saying that I think the whole project is a bad idea, just that this aspect of it concerns me.

P.S. A lot more in comments from Dale Lehman, who writes:

This is a challenge I [Lehman] entered and am still mad about. Here are some pertinent details:

1. The NEJM editors had published an anti-sharing editorial which attracted much criticism. They felt pressured to do something that either appeared pro-sharing or actually might move data sharing (from clinical trials) forward. So, they started this Challenge.

2. There were a number of awkward impediments to participating – including the need to get IRB approval (even though the data was anonymized and had already been used in publications) and have an officer at your institution/organization sign off that had financial authority (for what?).

3. 279 teams entered, 143 completed (there was a qualifying round and then a challenge round – ostensibly to make sure that entrants into the latter knew what they were doing enough to be allowed to participate), and 3 winners were selected.

4. I entered but did not win. My own “discovery” was that the results of the more aggressive blood pressure treatment depended greatly on whether or not participants in the trial had missed any of their scheduled visits – particularly if they missed one of the first 3 monthly visits that were in the protocol.

5. Since it appeared to me that compliance with the protocol was important, I was particularly interested in data about noncompliance, I asked about data on “adherence to antihyperintensive medications” which the protocol said data was collected in the trial. I was told that the original publication did not use that data, so I could not have it (so much for “novel” findings).

6. To make matters worse, I subsequently discovered that a different article has been published in a different journal (by some of the same authors) using the very adherence scale data I had asked for.

7. To make matters even worse, I sent a note to the editors complaining about this, and saying that either the authors misled the NEJM or the journal was complicit in this. I got no response.

8. The final winners did some nice work, but 2 of the 3 winners created decision tools (one was an app) providing a rating for a prospective patient as to whether or not more aggressive blood pressure treatment was recommended. I did not (and do not) think this is such a novel finding and it disturbs me that these entries focused on discrete (binary) choices – the uncertainty about the estimated effects disappeared. On the contrary, I submitted a way to view the confidence intervals (yes, sorry I still live in that world) for the primary effects and adverse events simultaneously.

So, yes I am upset by the experience, as were a number of other participants. The conference they held afterwards was also quite interesting – the panel of trial patients were universal in supporting open data sharing and were shocked that researchers were not enthralled by the idea. Of course, I am a sore loser and perhaps that is what all the other disgruntled lowers feel. But it is hard to escape the bad taste the whole thing left in my mouth.

When all the dust settles, it may still prove to be a small step forward towards more open sharing of clinical trial data and the difficulties may be due to the hard work of changing established and entrenched ways of doing things. But at this point in time, I don’t feel supportive of such a conclusion.

What are the odds of Trump’s winning in 2020?

Kevin Lewis asks:

What are the odds of Trump’s winning in 2020, given that the last three presidents were comfortably re-elected despite one being a serial adulterer, one losing the popular vote, and one bringing race to the forefront?

My reply:

Serial adulterer, poor vote in previous election, ethnicity . . . I don’t think these are so important. It does seem that parties do better when running for a second term (i.e., reelection) than when running for third term (i.e., a new candidate), but given our sparse data it’s hard to distinguish these three stories:
1. Incumbency advantage: some percentage of voters support the president.
2. Latent variable: given that a candidate wins once, that’s evidence that he’s a strong candidate, hence it’s likely he’ll win again.
3. Pendulum or exhaustion: after awhile, voters want a change.

My guess is that the chances in 2020 of the Republican candidate (be it Trump or someone else) will depend a lot on how the economy is growing at the time. This is all with the approximately 50/50 national division associated with political polarization. If the Republican party abandons Trump, that could hurt him a lot. But the party stuck with Trump in 2016 so they very well might in 2020 as well.

I guess I should blog this. Not because I’m telling you anything interesting but because it can provide readers a clue as to how little I really know.

Also, by the time the post appears in March, who knows what will be happening.

What is not but could be if

And if I can remain there I will say – Baby Dee

Obviously this is a blog that love the tabloids. But as we all know, the best stories are the ones that confirm your own prior beliefs (because those must be true).  So I’m focussing on  this article in Science that talks about how STEM undergraduate programmes in the US lose gay and bisexual students.  This leaky pipeline narrative (that diversity is smaller the further you go in a field because minorities drop out earlier) is pretty common when you talk about diversity in STEM. But this article says that there are now numbers! So let’s have a look…

And when you’re up there in the cold, hopin’ that your knot will hold and swingin’ in the snow…

From the article:

The new study looked at a 2015 survey of 4162 college seniors at 78 U.S. institutions, roughly 8% of whom identified as LGBQ (the study focused on sexual identity and did not consider transgender status). All of the students had declared an intention to major in STEM 4 years earlier. Overall, 71% of heterosexual students and 64% of LGBQ students stayed in STEM. But looking at men and women separately uncovered more complexity. After controlling for things like high school grades and participation in undergraduate research, the study revealed that heterosexual men were 17% more likely to stay in STEM than their LGBQ male counterparts. The reverse was true for women: LGBQ women were 18% more likely than heterosexual women to stay in STEM.

Ok. There’s a lot going on here. First things first, let’s say a big hello to Simpson’s paradox! Although LGBQ people have a lower attainment rate in STEM, it’s driven by men going down and women going up. I think the thing that we can read straight off this is that there are “base rate” problems happening all over the place. (Note that the effect is similar across the two groups and in opposite directions, yet the combined total is fairly strongly aligned with the male effect.) We are also talking about a drop out of around 120 of the 333 LGBQ students in the survey. So the estimate will be noisy.

I’m less worried about forking paths–I don’t think it’s unreasonable to expect the experience to differ across gender. Why? Well there is a well known problem with gender diversity in STEM.  Given that gay women are potentially affected by two different leaky pipelines, it sort of makes sense that the interaction between gender and LGBQ status would be important.

The actual article does better–it’s all done with multilevel logistic regression, which seems like an appropriate tool. There are p-values everywhere, but that’s just life. I struggled from the paper to work out exactly what the model was (sometimes my eyes just glaze over…), but it seems to have been done fairly well.

As with anything however (see also Gayface), the study is only as generalizable as the data set. The survey seems fairly large, but I’d worry about non-response. And, if I’m honest with you, me at 18 would’ve filled out that survey as straight, so there are also some problems there.

My father’s affection for his crowbar collection was Freudian to say the least

So a very shallow read of the paper makes it seems like the stats is good enough. But what if it’s not? Does that really matter?

This is one of those effects that’s anecdotally expected to be true. But more importantly, a lot of the proposed fixes are the types of low-cost interventions that don’t really need to work very well to be “value for money”.

For instance, it’s suggested that STEM departments work to make LGBT+ visibility more prominent (have visible, active inclusion policies). They suggest that people teaching pay attention to diversity in their teaching material.

The common suggestion for the last point is to pay special attention to work by women and under-represented groups in your teaching. This is never a bad thing, but if you’re teaching something very old (like the central limit theorem or differentiation), there’s only so much you can do. The thing that we all have a lot more control over is our examples and exercises. It is a no-cost activity to replace, for example, “Bob and Alice” with “Barbra and Alice” or “Bob and Alex”.

This type of low-impact diversity work signals to students that they are in a welcoming environment. Sometimes this is enough.

A similar example (but further up the pipeline) is that when you’re interviewing PhD students, postdocs, researchers, or faculty, don’t ask the men if they have a wife. Swapping to a gender neutral catch-all (partner) is super-easy. Moreover, it doesn’t force a person who is not in an opposite gender relationship to throw themselves a little pride parade (or, worse, to let the assumption fly because they’re uncertain if the mini-pride parade is a good idea in this context). Partner is a gender-neutral term. They is a gender-neutral pronoun. They’re not hard to use.

These environmental changes are important. In the end, if you value science you need to value diversity. Losing women, racial and ethnic minorities, LGBT+ people, disabled people, and other minorities really means that you are making your talent pool more shallow. A deeper pool leads to better science and creating a welcoming, positive environment is a serious step towards deepening the pool.

In defence of half-arsed activism

Making a welcoming environment doesn’t fix STEM’s diversity problem. There is a lot more work to be done. Moreover, the ideas in the paragraph above may do very little to improve the problem. They are also fairly quiet solutions–no one knows you’re doing these things on purpose. That is, they are half-arsed activism.

The thing is, as much as it’s lovely to have someone loudly on my side when I need it, I mostly just want to feel welcome where I am. So this type of work is actually really important. No one will ever give you a medal, but that doesn’t make it less appreciated.

The other thing to remember is that sometimes half-arsed activism is all that’s left to you. If you’re a student, or a TA, or a colleague, you can’t singlehandedly change your work environment. More than that, if a well-intentioned-but-loud intervention isn’t carefully thought through it may well make things worse. (For example, a proposal at a previous workplace to ensure that all female students (about 400 of them) have a female faculty mentor (about 7 of them) would’ve put a completely infeasible burden on the female faculty members.)

So don’t discount low-key, low-cost, potentially high-value interventions. They may not make things perfect, but they can make things better and maybe even “good enough”.

What We Talk About When We Talk About Bias

Shira Mitchell wrote:

I gave a talk today at Mathematica about NHST in low power settings (Type M/S errors). It was fun and the discussion was great.

One thing that came up is bias from doing some kind of regularization/shrinkage/partial-pooling versus selection bias (confounding, nonrandom samples, etc). One difference (I think?) is that the first kind of bias decreases with sample size, but the latter won’t. Though I’m not sure how comforting that is in small-sample settings. I’ve read this post which emphasizes that unbiased estimates don’t actually exist, but I’m not sure how relevant this is.

I replied that the error is to think that an “unbiased” estimate is a good thing. See p.94 of BDA.

And then Shira shot back:

I think what is confusing to folks is when you use unbiasedness as a principle here, for example here:

Ahhhh, good point! I was being sloppy. One difficulty is that in classical statistics, there are two similar-sounding but different concepts, unbiased estimation and unbiased prediction. For Bayesian inference we talk about calibration, which is yet another way that an estimate can be correct on average.

The point of my above-linked BDA excerpt is that, in some settings, unbiased estimation is not just a nice idea that can’t be done in practice or can be improved in some ways; rather it’s an actively bad idea that leads to terrible estimates. The key is that classical unbiased estimation requires E(theta.hat|theta) = theta for any theta, and, given that some outlying regions of theta are highly unlikely, the unbiased estimate has to be a contortionist in order to get things right for those values.

But in certain settings the idea of unbiasedness is relevant, as in the linked post above where we discuss the problems of selection bias. And, indeed, type M and type S errors are defined with respect to the true parameter values. The key difference is that we’re estimating these errors—these biases—conditional on reasonable values of the underlying parameters. We’re not interested in these biases conditional on unreasonable values of theta.

Subtle point, worth thinking about carefully. Bias is important, but only conditional on reasonable values of theta.

P.S. Thanks to Jaime Ashander for the above picture.

Bob’s talk at Berkeley, Thursday 22 March, 3 pm

It’s at the Institute for Data Science at Berkeley.

And here’s the abstract:

I’ll provide an end-to-end example of using R and Stan to carry out full Bayesian inference for a simple set of repeated binary trial data: Efron and Morris’s classic baseball batting data, with multiple players observed for many at bats; clinical trial, educational testing, and manufacturing quality control problems have the same flavor.

We will consider three models that provide complete pooling (every player is the same), no pooling (every player is independent), and partial pooling (every player is to some degree like every other player). Hierarchical models allow the degree of similarity to be jointly modeled with individual effects, tightening estimates and sharpening predictions compared to the no pooling and complete pooling models. They also outperform empirical Bayes and max marginal likelihood predictively, both of which rely on point estimates of hierarchical parameters (aka “mixed effects”). I’ll show how to fit observed data to make predictions for future observations, estimate event probabilities, and carry out (multiple) comparisons such as ranking. I’ll explain how hierarchical modeling mitigates the multiple comparison problem by partial pooling (and I’ll tie it into rookie of the year effects and sophomore slumps). Along the way, I will show how to evaluate models predictively, preferring those that are well calibrated and make sharp predictions. I’ll also show how to evaluate model fit to data with posterior predictive checks and Bayesian p-values.

Gaydar and the fallacy of objective measurement

Greggor Mattson, Dan Simpson, and I wrote this paper, which begins:

Recent media coverage of studies about “gaydar,” the supposed ability to detect another’s sexual orientation through visual cues, reveal problems in which the ideals of scientific precision strip the context from intrinsically social phenomena. This fallacy of objective measurement, as we term it, leads to nonsensical claims based on the predictive accuracy of statistical significance. We interrogate these gaydar studies’ assumption that there is some sort of pure biological measure of perception of sexual orientation. Instead, we argue that the concept of gaydar inherently exists within a social context and that this should be recognized when studying it. We use this case as an example of a more general concern about illusory precision in the measurement of social phenomena, and suggest statistical strategies to address common problems.

There’s a funny backstory to this one.

I was going through my files a few months ago and came across an unpublished paper of mine from 2012, “The fallacy of objective measurement: The case of gaydar,” which I didn’t even remember ever writing! A completed article, never submitted anywhere, just sitting in my files.

How can that happen? I must be getting old.

Anyway, I liked the paper—it addresses some issues of measurement that we’ve been talking about a lot lately. In particular, “the fallacy of objective measurement”: researchers took a rich real-world phenomenon and abstracted it so much that they removed its most interesting content. “Gaydar” existed within a social context—a world in which gays were an invisible minority, hiding in plain sight and seeking to be inconspicuous to the general population while communicating with others of their subgroup. How can it make sense to boil this down to the shapes of faces?

Stripping a phemenon of its social context, normalizing a base rate to 50%, and seeking an on-off decision: all of these can give the feel of scientific objectivity—but the very steps taken to ensure objectivity can remove social context and relevance.

We had some gaydar discussion (also here) on the blog recently and this motivated me to freshen up the gaydar paper, with the collaboration of Mattson and Simpson. I also recently met Michal Kosinski, the coauthor of one of the articles under discussion, and that was helpful too.

You need 16 times the sample size to estimate an interaction than to estimate a main effect

Yesterday I shared the following exam question:

In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

Here’s the solution:

If you have 80% power, then the underlying effect size for the main effect is 2.8 standard errors from zero. That is, the z-score has a mean of 2.8 and standard deviation of 1, and there’s an 80% chance that the z-score exceeds 1.96 (in R, pnorm(2.8, 1.96, 1) = 0.8).

Now to the interaction. The standard of an interaction is roughly twice the standard error of the main effect, as we can see from some simple algebra:
– The estimate of the main effect is ybar_1 – ybar_2, which has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example.
– The estimate of the interaction is (ybar_1 – ybar_2) – (ybar_3 – ybar_4), which has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). [algebra fixed]

And, from the statement of the problem, we’ve assumed the interaction is half the size of the main effect. So if the main effect is 2.8 on some scale with a se of 1, then the interaction is 1.4 with an se of 2, thus the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.

And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:

> raw < - rnorm(1e6, .7, 1)
> significant < - raw > 1.96
> mean(raw[significant])
[1] 2.4

So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.

Take-home point

The most important point here, though, has nothing to do with statistical significance. It’s just this: Based on some reasonable assumptions regarding main effects and interactions, you need 16 times the sample size to estimate an interaction than to estimate a main effect.

And this implies a major, major problem with the usual plan of designing a study with a focus on the main effect, maybe even preregistering, and then looking to see what shows up in the interactions. Or, even worse, designing a study, not finding the anticipated main effect, and then using the interactions to bail you out. The problem is not just that this sort of analysis is “exploratory”; it’s that these data are a lot noisier than you realize, so what you think of as interesting exploratory findings could be just a bunch of noise.

I don’t know if all this in the textbooks, but it should be.

Some regression simulations in R

In response to a comment I did some simulations which I thought were worth including in the main post.
Continue reading ‘You need 16 times the sample size to estimate an interaction than to estimate a main effect’ »

Here’s the title of my talk at the New York R conference, 20 Apr 2018:

The intersection of Graphics and Bayes, a slice of the Venn diagram that’s a lot more crowded than you might realize

And here are some relevant papers:

And here’s the conference website.

Classical hypothesis testing is really really hard

This one surprised me. I included the following question in an exam:

In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

None of the students got any part of this question correct.

In retrospect, the question was too difficult; it had too many parts given that it was an in-class exam, and I can see how it would be tough to figure out all these numbers. But the students even didn’t get close: they had no idea how to start. They had no sense that you can work backward from power to effect size and go from there.

And these were statistics Ph.D. students. OK, they’re still students and they have time to learn. But this experience reminds me, once again, that classical hypothesis testing is really really hard. All these null hypotheses and type 1 and type 2 errors are distractions, and it’s hard to keep your eye on the ball.

I like the above exam question. I’ll put it in our new book, but I’ll need to break it up into many pieces to make it more doable.

P.S. See here for an awesome joke-but-not-really-a-joke solution from an anonymous commenter.

P.P.S. Solution is here.

Reasons for an optimistic take on science: there are not “growing problems with research and publication practices.” Rather, there have been, and continue to be, huge problems with research and publication practices, but we’ve made progress in recognizing these problems.

Javier Benitez points us to an article by Daniele Fanelli, “Is science really facing a reproducibility crisis, and do we need it to?”, published in the Proceedings of the National Academy of Sciences, which begins:

Efforts to improve the reproducibility and integrity of science are typically justified by a narrative of crisis, according to which most published results are unreliable due to growing problems with research and publication practices. This article provides an overview of recent evidence suggesting that this narrative is mistaken, and argues that a narrative of epochal changes and empowerment of scientists would be more accurate, inspiring, and compelling.

My reaction:

Kind of amusing that this was published in the same journal that published the papers on himmicanes, air rage (see also here), and ages ending in 9 (see also here).

But, sure, I agree that there may not be “growing problems with research and publication practices.” There were huge problems with research and publication practices, these problems remain but there may be some improvement (I hope there is!). What’s happened in recent years is that there’s been a growing recognition of these huge problems.

So, yeah, I’m ok with an optimistic take. Recent ideas in statistical understanding have represented epochal changes in how we think about quantitative science, and blogging and post-publication review represent a new empowerment of scientists. And PNAS itself now admits fallibility in a way that it didn’t before.

To put it another way: It’s not that we’re in the midst of a new epidemic. Rather, there’s been an epidemic raging for a long time, and we’re in the midst of an exciting period where the epidemic has been recognized for what it was, and there are some potential solutions.

The solutions aren’t easy—they don’t just involve new statistics, they primarily involve more careful data collection and a closer connection between data and theory, and both these steps are hard work—but they can lead us out of this mess.

P.S. I disagree with the above-linked article on one point, in that I do think that science is undergoing a reproducibility crisis, and I do think this is a pervasive problem. But I agree that it’s probably not a growing problem. What’s growing is our awareness of the problem, and that’s a key part of the solution, to recognize that we do have a problem and to beware of complacency.

P.P.S. Since posting this I came across a recent article by Nelson, Simmons, and Simonsohn (2018), “Psychology’s Renaissance,” that makes many of the above points. Communication is difficult, though, because nobody cites anybody else. Fanelli doesn’t cite Nelson et al.; Nelson et al. don’t cite my own papers on forking paths, type M errors, and “the winds have changed” (which covers much of the ground of their paper); and I hadn’t been aware of Nelson et al.’s paper until just now, when I happened to run across it in an unrelated search. One advantage of the blog is that we can add relevant references as we hear of them, or in comments.

I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

Someone pointed me to this news article by Tim Schwab, “Brian Wansink: Data Masseur, Media Villain, Emblem of a Thornier Problem.” Schwab writes:

If you look into the archives of your favorite journalism outlet, there’s a good chance you’ll find stories about Cornell’s “Food Psychology and Consumer Behavior” lab, led by marketing researcher Brian Wansink. For years, his click-bait findings on food consumption have received sensational media attention . . .

In the last year, however, Wansink has gone from media darling to media villain. Some of the same news outlets that, for years, uncritically reported his research findings are now breathlessly reporting on Wansink’s sensational scientific misdeeds. . . .

So far, that’s an accurate description.

Wansink’s work was taken at face value by major media. Concerns about Brian Wansink’s claims and research methods had been known for years, but these concerns had been drowned out by the positive publicity—much of it coming directly from Wansink’s lab, which had its own publicity machine.

Then, a couple years ago, word got out that Wansink’s research wasn’t what it had been claimed to be. It started with some close looks at Wansink’s papers which revealed lots of examples of iffy data manipulation: you couldn’t really believe what was written in the published papers, and it was not clear what had actually been done in the research. The story continued when outsiders Tim van der Zee​, Jordan Anaya​, and Nicholas Brown found over 150 errors in four of Wansink’s published papers, and Wansink followed up by acting as if there was no problem at all. After that, people found lots more inconsistencies in lots more of Wansink’s papers.

This all happened as of spring, 2017.

News moves slowly.

It took almost another year for all these problems to hit the news, via some investigative reporting by Stephanie Lee of Buzzfeed.

The investigative reporting was excellent, but really it shouldn’t’ve been needed. Errors had been found in dozens of Wansink’s papers, and he and his lab had demonstrated a consistent pattern of bobbing and weaving, not facing these problems but trying to drown them in happy talk.

So, again, Schwab’s summary above is accurate: Wansink was a big shot, loved by the news media, and then they finally caught on to what was happening, and he indeed “has gone from media darling to media villain.”

But then Schwab goes off the rails. It starts with a misunderstanding of what went wrong with Wansink’s research.

Here’s Schwab:

His misdeeds include self-plagiarism — publishing papers that contain passages he previously published — and very sloppy data reporting. His chief misdeed, however, concerns his apparent mining and massaging of data — essentially squeezing his studies until they showed results that were “statistically significant,” the almighty threshold for publication of scientific research.

No. As I wrote a couple weeks ago, I fear that many people are drawing the wrong lessons from the Wansink saga, focusing on procedural issues such as “p-hacking” rather than scientifically more important concerns about empty theory and hopelessly noisy data. If your theory is weak and your data are noisy, all the preregistration in the world won’t save you.

To speak of “apparent mining and massaging of data” is to understate the problem and to miss the point. Remember those 150 errors in those four papers, and how that was just the tip of the iceberg? The problem is not that data were “mined” or “massaged,” the problem is that the published articles are full of statements that are simply not true. In several of the cases, it’s not clear where the data are, or what the data ever were. There’s the study of elementary school children who were really preschoolers, the pizza data that don’t add up, the carrot data that don’t add up, the impossible age distribution of World War II veterans, the impossible distribution of comfort ratings, the suspicious distribution of last digits (see here for several of these examples).

Schwab continues:

And yet, not all scientists are sure his misdeeds are so unique. Some degree of data massaging is thought to be highly prevalent in science, and understandably so; it has long been tacitly encouraged by research institutions and academic journals.

No. Research institutions and academic journals do not, tacitly or otherwise, encourage people to report data that never happened. What is true is that research institutions and academic journals rarely check to see if data are reasonable or consistent. That’s why it is so helpful that van der Zee​, Anaya​, and Brown were able to run thousands of published papers through a computer program use statistical tools to check for certain obvious data errors, of which Wansink’s paper had many.

Schwab writes:

I wonder if we’d all be a little less scandalized by Wansink’s story if we always approached science as something other than sacrosanct, if we subjected science to scrutiny at all times, not simply when prevailing opinion makes it fashionable.

That’s a good point. I think Schwab is going too easy on Wansink—I really do think it’s scandalous when a prominent researcher publishes dozens of papers that are purportedly empirical but are consistent with no possible data. But I agree with him that we should be subjecting science to scrutiny at all times.

P.S. In his article Schwab also mentions power-pose researcher Amy Cuddy. I won’t get into this except to say that I think he should also mention Dana Carney—she’s the person who actually led the power-pose study and she’s also the person who bravely subjected her own work to criticism—and Eva Ranehill, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, Roberto Weber, and Anna Dreber, who did the careful replication study that led to the current skeptical view of the original power pose claims. I think that one of the big problems with science journalism is that researchers who make splashy claims get tons of publicity, while researchers who are more careful don’t get mentioned.

I think Schwab’s right that the whole Wansink story is unfortunate: First he got too much positive publicity, now he’s getting too much negative publicity. The negative publicity is deserved—at almost any time during the past several years, Wansink could’ve defused much of this story by simply sharing his data and being open about his research methods, but instead he repeatedly attempted to paper over the cracks—but it personalizes the story of scientific misconduct in a way that can be a distraction from larger issues of scientists being sloppy at best and dishonest at worst with their data.

I don’t know the solution here. On one hand, here Schwab and I are as part of the problem—we’re both using the Wansink story to say that Wansink is a distraction from the larger issues. On the other hand, if we don’t write about Wansink, we’re ceding the ground to him, and people like him, who unscrupulously seek and obtain publicity for what is, ultimately, pseudoscience. It would’ve been better if some quiet criticisms had been enough to get Brian Wansink and his employers to clean up their act, but it didn’t work that way. Schwab questions Stephanie Lee’s journalistic efforts that led to smoking-gun-style emails—but it seems like that’s what it took to get the larger world to listen.

Let’s follow Schwab’s goal of “subjecting science to scrutiny at all times”—and let’s celebrate the work of van der Zee​, Anaya​, Brown, and others who apply that scrutiny. And if it turns out that a professor at a prestigious university who’s received millions of dollars from government and industry and who’s received massive publicity for purportedly empirical results that are not consistent with any possible data, then, yes, that’s worth reporting.

A more formal take on the multiverse

You’ve heard of multiverse analysis, which is an attempt to map out the garden of forking paths. Others are interested in this topic too. Carol Nickerson pointed me to this paper by Jan Wacker with a more formal version of the multiverse idea.

3 cool tricks about constituency service (Daniel O’Donnell and Nick O’Neill edition)

I’m a political scientist and have studied electoral politics and incumbency, but I’d not thought seriously about constituency service until a couple years ago, when I contacted some of our local city and state representatives about a nearby traffic intersection that seemed unsafe. I didn’t want any kids to get run over by drivers who could easily have been misled by the street design into taking the curve too fast.

It took awhile, but after a few years, the intersection got fixed, thanks to assemblymember Daniel O’Donnell and his chief of staff Nick O’Neill.

This is pretty basic constituency service and you can bet that I’ll vote for O’Donnell for pretty much anything at this point, at least absent any relevant new information.

But this point of post is not to endorse my state rep. Rather, I wanted to share the new perspective I’ve gained regarding constituency service.

Before now, I’ve thought of constituency service as close to irrelevant to political performance. I mean, sure, it’s great if you can rescue some cat stuck up a tree or help untangle somebody’s paperwork, but the real job of a legislator is to help pass good laws, to stop bad laws from passing, and to exercise oversight on the executive and the judiciary.

But after this O’Donnell thing I have a different view. For one thing, I contacted several officeholders, and he was the only one to act. This action signals to me that he thinks that the safety of kids crossing the street is more important than the hassle of getting the Department of Transportation to make a change. This is actually a big deal, not just in itself but in having a local politician who’s not afraid of the DOT.

More generally, one can view constituency service on issue X as a sign that the politician in question thinks issue X is worth going to some trouble for. Those other politicians who didn’t respond to the request regarding the dangerous street (not even to give a reasoned response, perhaps convincing me that the intersection was actually safe, contrary to appearances)? I’m not so thrilled with their priorities.

I’m not saying that that constituency service is a perfect signal; of course it’s just one piece of information. My point is that constituency service conveys more information than I’d realized: it’s not just about the legislator or someone in his office being energetic or a nice guy; it also tells us something about his priorities. In this case, I don’t see Daniel O’Donnell’s help on this as a way for him to get a vote or even as a way for him to quiet a squeaky wheel. Rather, I see it as him taking an opportunity to make the city a little bit of a better place, using my letter as a motivation to do something he would’ve wanted to do anyway. We work on systemic problems, and we also fix things one at a time when we can.

Murray Davis on learning from stories

Jay Livingston writes:

Your recent post and the linked article on storytelling reminded me of Murray Davis’s article on theory, which has some of the same themes. I haven’t reread it in a long time, so my memory of the details is hazy. Here are the first two paragraphs, which might give you an idea of what the remaining 15,000 words contains.

It has long been thought that a theorist is considered great because his theories are true, but this is false. A theorist is considered great, not because his theories are true, but because they are interesting. Those who carefully and exhaustively verify trivial theories are soon forgotten, whereas those who cursorily and expediently verify interesting theories are long remembered. In fact, the truth of a theory has very little to do with its impact, for a theory can continue to be found interesting even though its truth is disputed — even refuted!

Since this capacity to stimulate interest is a necessary if not sufficient characteristic of greatness, then any study of theorists who are considered great must begin by examining why their theories are considered interesting — why, in other words, the theorist is worth studying at all. But before we can attempt even this preliminary task we must understand clearly why some theories are considered interesting while others are not. In this essay, I will try to determine what it means for a theory to be considered interesting (or, in the extreme, fascinating).

That’s Interesting! Towards a Phenomenology of Sociology and a Sociology of Phenomenology
By Murray S. Davis
Phil. Soc. Sci. 1 (1971), 309-344

A quick search found this copy of Davis’s article online. I agree that these ideas overlap with those of Basbøll and me; Davis just as a different focus, as he’s engaging with the literatures in philosophy and sociology, whereas we come at the problem from a philosophy-of-science and literary perspective.

Also interesting is this statement from Davis:

I [Davis] contend that the ‘generation’ of interesting theories ought to be the object of as much attention as the ‘verification’ of insipid ones.” [Emphasis in the original.]

Nowadays we wouldn’t talk of “verification” of a theory (even though lots of people in the “Psychological Science” or “PPNAS” world seem to think that way). And, indeed, I’m concerned less about “insipid” theories than about exciting-sounding theories (shark attacks swing elections! beautiful people have more daughters! Cornell students have ESP! himmicanes!) that don’t make a lot of sense and aren’t supported by the data. That all said, I agree that the generation of theories is not well understood and that this is a topic that deserves further study.

Hey, could somebody please send me a photo of a cat reading a Raymond Carver story?

Thanks in advance!

P.S. Jaime Ashander sent in a photo. Thanks, Jaime!

Incorporating Bayes factor into my understanding of scientific information and the replication crisis

I was having this discussion with Dan Kahan, who was arguing that my ideas about type M and type S error, while mathematically correct, represent a bit of a dead end in that, if you want to evaluate statistically-based scientific claims, you’re better off simply using likelihood ratios or Bayes factors. Kahan would like to use the likelihood ratio to summarize the information from a study and then go from there. The problem with type M and type S errors is that, to determine these, you need some prior values for the unknown parameters in the problem.

I have a lot of problems with how Bayes factors are presented in textbooks and articles by various leading Bayesians, but I have nothing against Bayes factors in theory.

So I thought it might help for me to explain, using an example, how I’d use Bayes factors in a scenario where one could also use type M and type S errors.

The example is the beauty-and-sex-ratio study described here, and the is that the data are really weak (not a power=.06 study but a power=.0500001 .0501 study or something like that). The likelihood for the parameter is something like normal(.08, .03^2)–that is, there’s a point estimate of 0.08 (an 8 percentage point difference in Pr(girl birth), comparing children of beautiful parents to others) with a se of 0.03 (that is, 3 percentage points). From the literature and some math reasoning (not shown here) having to do with measurement error in the predictor, reasonable effect sizes are anywhere between 0 and, say, +/- 0.001 (one-tenth of a percentage points); see the above-linked paper.

The relevant Bayes factor here is not theta=0 vs theta!=0. Rather, it’s theta=-0.001 (say) vs. theta=0 vs. theta=+0.001. Result will show Bayes factors very close to 1 (i.e., essentially zero evidence); also relevant is the frequentist calculation of how variable the Bayes factors might be under the null hypothesis that theta=0.

I better clarify that last point: The null hypothesis is not scientifically interesting, nor do I learn anything useful about sex ratios from learning that the p-value of the data relative to the null hypothesis is 0.20, or 0.02, or 0.002, or whatever. However, the null hypothesis can be useful as a device for approximating the sampling distribution of a statistical procedure.

P.S. See here for more from Kahan.

“and, indeed, that my study is consistent with X having a negative effect on Y.”

David Allison shares this article:

Pediatrics: letter to the editor – Metformin for Obesity in Prepubertal and Pubertal Children A Randomized Controlled Trial

and the authors’ reply:

RE: Clarification of statistical interpretation in metformin trial paper

The authors of the original paper were polite in their response, but they didn’t seem to get the point of the criticism they were purportedly responding to.

Let’s step back a moment

Forget about the details of this paper, Allison’s criticism, and the authors’ reply.

Instead let’s ask a more basic question: How does one respond to scientific criticism?

It’s my impression that, something like 99% of the time, authors response to criticism is predicated on the assumption that they were completely correct all along: the idea is that criticism is something to be managed. Tactical issues arise—Should the authors sidestep the criticism or face it head on? Should they be angry, hurt, dismissive, deferential, or equanimous?—but the starting point is the expectation of zero changes in the original claims.

That’s a problem. We all make mistakes. The way we move forward is by learning from our mistakes. Not from denying them.

Here was my response to Allison: you think that’s bad; check out this journal-editor horror story. These people are actively lying.

Admitting and learning from our errors

Allison responded:

We (meaning the scientific community in its broadest form) definitely have a long way to go in learning how to adhere scrupulously to truthfulness, to give and respond to criticism constructively and civilly, and how to admit mistakes and correct them.

I like this line from Eric Church: “And when you’re wrong, you should just say so; I learned that from a three year old.”

I wish more people would be willing to say:

You’re right. I made a mistake. My study does not show that X causes Y. I may still believe that X causes Y, but I acknowledge that my study does not show it.

We do occasionally get folks to write that in response to our comments, but it is all too rare.

Anyway, right now I have been looking at papers that make unjustified causal inferences because of neglecting (or not realizing) the phenomenon of regression to the mean. Regression to the mean really seems to confuse people.

And I replied: You write:

I wish more people would be willing to say:

You’re right. I made a mistake. My study does not show that X causes Y. I may still believe that X causes Y, but I acknowledge that my study does not show it.

I’d continue with, “and, indeed, that my study is consistent with X having a negative effect on Y. Or, more generally, having an effect that varies by context and is sometimes positive and sometimes negative.

Also, I think that the causal discussion can mislead, in that almost all these issues arise with purely correlational studies. For example, the silly claim that beautiful parents are more likely to have daughters. Forget about causality; the real point is that there’s no evidence supporting the idea that there is such a correlation in the population. There’s a tendency of people to jump from the “stylized fact” to the purported causal explanation, without recognizing that there’s no good evidence for the stylized fact.

Another reason not to believe the Electoral Integrity Project

Nick Stevenson writes:

If wonder if the Electoral Integrity Project still wants to defend Rwanda’s score of 64? Or is the U.S. (electoral integrity score 61) just jealous?

Stevenson was reacting to a news article from the Washington Post (sorry, the link no longer works) that reported:

The United States said Saturday it was “disturbed by irregularities observed during voting” in Rwanda’s election, which longtime President Paul Kagame won with nearly 99 percent of the vote.

A State Department statement reiterated “long-standing concerns over the integrity of the vote-tabulation process.”

Last time we heard about the Electoral Integrity Project, it was in the context of their claims that North Carolina is no longer a democracy but North Korea isn’t so bad (see also this response by Pippa Norris).

I responded that this Rwanda thing does seem to represent a problem with the international measure, similar to what happened with North Korea. Perhaps the measures are implicitly on a relative scale, so that Rwanda = 64 because Rwanda is about as bad as one might expect given its reputation, while U.S. = 61 because the U.S. is worse than one might hope, given its reputation?

Stevenson replied:

I wouldn’t say it’s as bad as North Korea, which is obviously a zero by any reasonable metric.

Kagame does have some defenders—see this article from today by Melina Platas in the Monkey Cage which notes that Kagame (like Putin) is domestically popular—accompanied by some rather eye-popping concessions:

Are some Rwandans intimidated by the state? Certainly. Does the ruling party have roots down to the lowest level? Definitely. Do opposition candidates have far fewer resources? Undeniably. Are some of those who wish to run for president unable to? Yes.

But I don’t think the author of this piece would maintain that Rwanda’s elections were freer and fairer than the USA’s.

This exchange happened in Aug 2016. I contacted Norris who said that the data would be available in February/March 2018. So anyone who’s interested should be able to go to the data soon and try to figure out what went wrong with the Rwanda survey.

P.S. The enumeration in the blog, of certain errors, shall not be construed to deny or disparage other work done by these researchers.

Important statistical theory research project! Perfect for the stat grad students (or ambitious undergrads) out there.

Hey kids! Time to think about writing that statistics Ph.D. thesis.

It would be great to write something on a cool applied project, but: (a) you might not be connected to a cool applied project, and you typically can’t do these on your own, you need collaborators who know what they’re doing and who care about getting the right answer; and (b) you’re in your doctoral program learning all this theory, so now’s the time to really learn that theory, by using it!

So here we are at Statistical Modeling, Causal Inference, and Social Science to help you out. Yes, that’s right, we have a thesis topic for you!

The basic idea is here, a post that was written several months ago but just happened to appear this morning. Here’s what’s going on: In various areas of the human sciences, it’s been popular to hypothesize, or apparently experimentally prove, that all sorts of seemingly trivial interventions can have large effects. You’ve all heard of the notorious claim, unsupported by data, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” but that’s just one of many many examples. We’ve also been told that whether a hurricane has a boy or a girl name has huge effects on evacuation behavior; we’ve been told that male college students with fat or thin arms have different attitudes toward economic redistribution, with that difference depending crucially on the socioeconomic status of their parents; we’ve been told that women’s voting behavior varies by a huge amount based on the time of the month, with that difference depending crucially on their relationship status; we’ve been told that political and social attitudes and behavior can be shifted in consistent ways by shark attacks and college football games and subliminal smiley faces and chance encounters with strangers on the street and, ummm, being “exposure to an incidental black and white visual contrast.” You get the idea.

But that’s just silly science, it’s not a Ph.D. thesis topic in statistical theory—yet.

Here’s where the theory comes in. I’ve written about the piranha problem, that these large and consistent effects can’t all, or even mostly, be happening. The problem is that they would interfere with each other: On one hand, you can’t have dozens of large and consistent main effects or else it would be possible to push people’s opinions and behavior to ridiculously implausible lengths just by applying several stimuli in sequence (for example, football game plus shark attack plus fat arms plus an encounter on the street). On the other hand, once you allow these effects to have interactions, it becomes less possible for them to be detected in any generalizable way in an experiment. (For example, the names of the hurricanes could be correlated with recent football games, shark attacks, etc.)

We had some discussion of this idea in the comment thread (that’s where I got off the quip, “Yes, in the linked article, Dijksterhuis writes, ‘The idea that merely being exposed to something that may then exert some kind of influence is not nearly as mystifying now as it was twenty years ago.’ But the thing he doesn’t seem to realize is that, as Euclid might put it, there are an infinite number of primes…”, and what I’m thinking would really make the point clear would be to demonstrate it theoretically, using some sort of probability model (or, more generally, mathematical model) of effects and interactions.

A proof of the piranha principle, as it were. Some sort of asymptotic result as the number of potential effects increases. I really like this idea: it makes sense, it seems amenable to theoretical study, it could be modeled in various different ways, it’s important for science and engineering (you’ll have the same issue when considering A/B tests for hundreds of potential interventions), and it’s not trivial, mathematically or statistically.

As always, I recommend starting with fake-data simulation to get an idea of what’s going on, then move to some theory.

P.S. You might think: Hey, I’m reading this, but hundreds of other statistics Ph.D. students are reading this at the same time. What if all of them work on this one project? Then do I need to worry about getting “scooped”? The answer is, No, you don’t need to worry! First, hundreds of Ph.D. students might read this post, but only a few will pick this topic. Second, there’s a lot to do here! My first pass above is based on the normal distribution, but you could consider other distributions, also look not just at the distribution of underlying parameter values but at the distribution of estimates, you could embed the whole problem in a time series structure, you could look at varying treatment effects, there’s the whole issue of how to model interactions, there’s an entirely different approach based on hard bounds, all sorts of directions to go. And that’s not meant to intimidate you. No need to go in all these directions at once; rather, any of these directions will give you a great thesis project. And it will be different from everyone else’s on the topic. So get going, already! This stuff’s important, and we can use your analytical skills.