Faye Flam wrote a solid article for the New York Times on Bayesian statistics, and as part of her research she spent some time on the phone with me awhile ago discussing the connections between Bayesian inference and the crisis in science criticism. My longer thoughts on this topic are in my recent article, “The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective,” but of course many more people will get the short version that appeared in the newspaper.
That’s fine, and Flam captured the general “affect” of our discussion—the idea that Bayes allows the use of prior information, and that p-values can’t be taken at face value. As I discuss below, I like Flam’s article, I’m glad it’s out there, and I’m grateful that she took the time to get my perspective.
Unfortunately, though, some of the details got garbled.
Flam never put quotation marks around anything I said, and I know that with journalism there isn’t always time to check every paragraph. After I saw the article online I pointed out the mistakes and Flam asked the NYT editors to correct them so I hope this will be done soon.
In the meantime, I’ll post the corrections here.
In the article, it says:
But there’s a danger in this [p-value] tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.
No no no no no. I recommended correcting as follows:
But there’s a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent can lead to spurious findings—cases where an observed “statistically significant” pattern in data does not reflect a corresponding pattern in the population—far more than 5 percent of the time. The weaker the signal and the noisier the measurements, the more likely that a pattern, even if statistically significant, will not replicate.
To the outsider this might sound almost the same, but on a technical level it makes a big difference!
The article then says that I say:
The proportion of wrong results published in prominent journals is probably even higher
I would change this to:
This could well be an even bigger problem with prominent journals
Later the article refers to the notorious fecundity-and-voting study and says:
Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.
He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated.
This is not correct. I did not re-evaluated the study using Bayesian methods, nor did I claim to have done so.
Here’s my suggested revision:
Dr. Gelman felt this result was not consistent with polling data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. And after accounting for the many different analyses that could have been performed on the data, the study’s statistical significance evaporated.
Finally, the article writes of me:
He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.
I wouldn’t quite put it that way! I prefer:
He says that in such studies there is strong prior information, which can be included using Bayesian methods or in other ways.
Putting it into perspective
I suppose journalists find it difficult to deal with academics because we’re so picky. As I noted above, I think the article captured the general sense of what I was saying, and overall I like the article, I like how Flam quoted people who had varying perspectives; I think it’s important for people to see statistics as a pluralistic field with different tools for solving different problems.
But I do think the details matter (and I certainly don’t want people to think I said things I didn’t say, or that I did things I didn’t do) so I hope the corrections can be made soon. And I stand by the larger point that lots of bad stuff happens when people think that “statistically significant” + “vague theory” = truth. I can’t say that I’m surprised that Kristina Durante, the author of the fecundity-and-voting study, stands by those claims, but I think it’s too bad. The point is not that there’s anything horrible about Durante (a person whom I’ve never met), nor do I know of anything horrible about Daryl Bem, etc., but that there is widespread confusion about how to do statistics, especially when studying small effects in the presence of large measurement errors (that’s one of the things I discuss in my above-cited article), and I’m glad to get these concerns out there, as precisely as is possible within the format of a newspaper article.
In any case, this’ll be an excellent example for my statistical communication class!
P.S. I also just noticed this bit from the article:
The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.
By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.
No!!!!!!!!!!!!!! Weighting a coin does not (appreciably) affect the probability that a coin lands heads. You can load a die but you can’t bias a coin. Yes, with practice you can throw a coin (weighted or otherwise) to generally land heads or tails, but, no, there is no such thing as a weighted coin which has an appreciably greater than 50% chance of generally landing heads. No big deal but this is one of my pet peeves. Also, beyond the flaws in this particular example, I don’t think it’s a good representation of science, in that the point to me is not to distinguish fair from unfair coins (equivalently, to distinguish randomness from non-randomness) but rather to understand the many real patterns in the world, which are not purely random but can be buried in noise if we’re not careful, hence motivating noise-reduction efforts such as this, with Sharad Goel, David Rothschild, and Doug Rivers. (And my point there was not to promote that work but to illustrate my general point with an example.)