The other day I commented on a new Science News article by Tom Siegfried about statistics and remarked:

If there were a stat blogosphere like there’s an econ blogosphere, Siegfried’s article would’ve spurred a ping-ponging discussion, bouncing from blog to blog.

In response, various people pointed out to me in comments and emails that there *has* been a discussion on statistics blogs of this article; we just don’t have the critical mass of cross-linkages to maintain a chain reaction of discussion.

I’ll try my best to inaugurate a statistics-blogosphere symposium, though.

Before going on, though . . . Note to self: Publish an article in Science News. Tom Siegfried’s little news article got more reaction than just about anything I’ve ever written!

OK, on to the roundup, followed at the end by my latest thoughts (including a phrase in **bold**!).

– I didn’t really have much to say about the news article when it came out and in fact only posted on it because four different people emailed to ask my thoughts on it. I was already aware of the controversy surrounding the disjoint between statistics in the field and in the textbook, and I thought Siegfried captured the issue pretty well (except for the discussion of Bayesian statistics, which promulgated some common misconceptions that I tried to correct in my blog entry).

– Dan Lakeland used the Science News article to compare actual scientific research (in ts best form) to the cargo-cult version of scientific method presented in statistics textbooks (“null hypotheses,” “alternative hypotheses,” p-values, and the rest).

See also Lakeland’s follow-up, where he threw in a description of me that’s pretty accurate, considering that we’ve never met:

If you are a productive, professional, grant-funded scientist today, you are probably about 50 years old. You went to graduate school in the 1980’s. When you learned about statistics, computers were just about fast enough that they could sort of keep up with your typing speed over the 1200 baud modem that connected you to the university mainframe. The idea of running a 10000 iteration MCMC sampling scheme on a partially nested 4 level model with 1/2 million observations was something Andrew Gelman was maybe just dreaming about, and if he was trying it out he was certainly writing custom FORTRAN code to do it.

I indeed wrote custom Fortran code in my thesis! And I remember loving the 1200 baud modem. It was so, so much better than the 300 baud connection. But that was when I was in college. By the time I was in grad school we were using workstations.

– Real-life private-sector statistician Kaiser Fung slammed Siegfried for sensationalism. Kaiser says that, realistically, we’re never going to have absolute truth in our statistical analyses, but that the problem is not with p-values, significance levels, Bayes, or anything else like that, but just the nature of human knowledge: “False results are part of the process of scientific inquiry, not a sign of its failure.”

Or, as we tell the students when teaching sampling theory: In real life, sampling error is just a lower bound on uncertainty; nonsampling error is the most important problem. But as statisticians, we focus on sampling error because that’s *our* unique contribution to the endeavor. Your doctor helps with your health, your minister gives you religion, you get your music from WFUV, and your friendly neighborhood statistician computes your standard errors. It’s called division of labor, and criticizing statistics for not solving all your scientific problems makes no more sense than criticizing your rabbi for not curing your pneumonia or sadly concluding that your D.J.–despite his wit and excellent taste in music–can’t do anything useful about those rude drivers on your morning commute.

– James Annan agreed with Siegfried that p-values can mislead, but he doesn’t seem to feel that statistics as a whole is about to fall apart.

– A physics blogger called Tamino wrote: “the foundation [of statistics] is not flimsy, it’s solid as a rock. Statistics works, it does what it’s supposed to do. But it is susceptible to misinterpretation, to false results purely due to randomness, to bias, and of course to error. That’s what the ScienceNews article is really about, although it takes liberties (in my opinion) in order to sensationalize the issue. But hey, that’s what magazines (not peer-reviewed journals) do.” Well put, although I disagree with some of Tamino’s later statements on probability (more on this below).

– Tamino’s remarks are ultimately focused not so much on statistics but on applications in climate science, and he was responding to Anthony Watts, who welcomed Siegfried’s article for “pointing out an over-reliance on statistical methods can produce competing results from the same base data. Watts also links to this fun page of statistics quotes, but I’m not at all impressed by this quote from Ernest Rutherford: “If your experiment needs statistics, you ought to have done a better experiment.” That’s just obnoxious. In the meantime, before you have the “better experiment,” you still might have to make some decisions.

– Physicist Lubos Motl used the Siegfried article as a springboard for a very reasonable discussion of the role of hypothesis testing in statistical reasoning. I was trained as a physicist myself, so maybe that’s one reason I’m comfortable with this way of thinking. Motl writes: “statistical methods have always been essential in any empirically based science. In the simplest situation, a theory predicts a quantity to be “P” and it is observed to be “O”. The idea is that if the theory is right, “O” equals “P”. In the real world, neither “O” nor “P” is known infinitely accurately. . . . if “O” and “P” are (much) further from one another than both errors of “O” as well as “P”, the theory is falsified. It’s proven wrong. If they’re close enough to one another, the theory may pass the test: we failed to disprove it. But as always in science, it doesn’t mean that the theory has been proven valid. Theories are never proven valid “permanently”. They’re only temporarily valid until a better, more accurate, newer, or more complete test finds a discrepancy and falsifies them.” This is a refreshing departure from naive and (to me) pointless discussions of “the probability the null hypothesis is true” (again, more on that below).

Unfortunately, Motl went a bit too far for me when he starts talking about social and environmental science, saying that if effects are “claimed to be established at the 90% confidence level, it’s just an extremely poor evidence.” At a mathematical level, I know what he’s saying: 90% confidence is just 1.65 standard errors from zero, and that’s not far at all from a statistically insignificant 1 standard error from zero. Still, to go back to our earlier point (or to Phil’s discussion of inference for climate change), decisions do need to be made, and it’s best to summarize the inference we do have as best we can, even as we wait for better data and models.

– Statistical consultant Mark Palko (who wrote, “I nearly stopped reading when I hit the phrase ‘mutant form of math'”) took the discussion in a different direction: “I [Palko] wonder if in an effort to make things as simple as possible, we haven’t actually made them simpler. . . . Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?” This is a point that I’ve talked about on occasion in the political science context: There have been fewer than 20 presidential elections in modern (post-World War 2) politics, so, yes, demanding 95% confidence for inferences from such data seems to miss the point. There’s already more than a 1 in 20 chance, I think, of some sort of major change that would make your model irrelevant. On a related point, Palko asked, “In fields like econ where researchers often have to rely natural experiments based on rare combinations of events, does it even make sense to discuss replication?”

– An engineer named William Connolley linked to Siegfried’s article and writes, “much of science isn’t statistical at all. . . . the sciencenews thing itself seems to be mostly thinking about medicine, where they use stats a lot because they don’t know what is really going on.” Well, yes and no. Sometimes medical researchers know what is really going on and sometimes they don’t, but in either case there’s a lot of individual variation–people’s bodies are different–and so statistics can be helpful.

**My thoughts**

As I already noted, I thought Siegfried’s article was basically OK in that he was capturing some real discontent among users of statistics. Whether or not statistics has a firm foundation, many scientists certainly *feel* that there are fundamental problems, and it’s not really Siegfried’s job to take a stand here. As a reporter, he’s reporting what different people think. I mean, sure, he could’ve concluded from his interview with me that statistics has a firm foundation–but why should he have trusted me more than the various other people he interviewed? What if he had made the mistake of trusting someone who said that statistics is only for people who “don’t know what is really going on”?? Setting the rhetoric aside, and also setting aside a few technical mistakes (noted in my earlier blog entry; see the very first link above), I think Siegfried did a reasonable job of laying out the controversy.

My perspective on some of this is, I believe, similar to Feynman’s irritated reaction when people asked him if light is a particle, or a wave, or a “wavicle.” From his perspective, light is particles: yes, particles that go around corners, but particles nonetheless. Now, I certainly don’t want to get into a discussion of quantum physics here; my point is that I share Feynman’s annoyance with pseudo-deep philosophical discussions which might begin as attempts to explain tricky concepts but quickly become morasses in themselves.

That’s how I feel about this whole subjective Bayesian thing. When I set up, fit, check, and improve a “Bayesian” model, it’s no more subjective than when Brad Efron decides what “estimator” to use and what set of replications to “bootstrap” over, or when Neyman and Pearson decided what “probability law” to use, or when Savage decided what “loss function” to minimize or when Cox decided how to construct his “semiparametric” model, etc etc etc. It’s about what we do, and what information we use. I can see how Tukey could’ve gotten so fed up with all the theory and philosophy that he decided just to present some graphical methods and not specify where they came from. If it’s all about the method, just present the method. I don’t go that far–when it comes to statistics, I ultimately find modeling to be more flexible and effective than direct construction of algorithms–but I see the appeal of chucking it all. Especially after hearing one more time the same old B.S. about subjectivity and objectivity. (For perhaps my definitive take on the topic, see here.)

That said, the connection between statistical modeling and reality can be tricky. You have Larry Wasserman, who works with physicists and should know better, thinking that, in particle physics, 95 percent of published 95% intervals will actually contain the truth, while Lubos Motl, who is a physicist and actually does know better, reminding us that, no, our models our full of errors and we should be wary of the nominal probabilities that come out of our statistical estimation.

**Please don’t talk to me about the Pr (null hypothesis)**

I agree with Tom Siegfried, Don Rubin, and the many many others who have criticized p-values–whatever their performance might be in theory–for being routinely misunderstood in practice. A p-value is the probability of seeing something as extreme as was observed, if the model were true. It is *not* under any circumstances a measure of the probability that the model is true.

The logical next step, **which I hate hate hate hate hate**, is to then try to calculate the probability that the null hypothesis is true. No. I refuse to do this. As a statistician, I am generally supportive of “give the people what they want” sorts of arguments, but this time I say no. In all the settings I’ve ever worked on, the probability that the model is true is . . . zero! I prefer Bayesian inference (or, more generally, interval estimation) for quantitative parameters and graphical checks (or, on occasion, p-values) to summarize ways in which the model doesn’t fit reality. Lots of problems are caused, I believe, by the often unquestioned idea that we should want to calculate, or estimate, the probability of a model being true. See here for more more more on this topic.

P.S. I still think the econ blogosphere has us beat: no matter now hard I try, I can’t capture the drama of the Krugmeister battling the freshwaterites, with Cowen, Tabarrok, et al. throwing fuel on the fire and Mark Thoma keeping score. And it’s also funny for this entire discussion to have been sparked by an innocuous if dramatically-phrased article in Science News. But, hey, we gotta start somewhere.

P.P.S. Actually, Alex T. did comment on Siegfried’s article, making an observation similar to Motl’s (noted above) that all the discussions of p-values shouldn’t obscure the importance of errors in the model.

Andrew: Great summary. The econosphere has us beat because we are too civil. That's probably due to our makeup: we have to be comfortable with gray to be a statistician; there's no way around it.

I cannot agree more with your Pr(null hypothesis) point. In addition to what you wrote, having that probability does not resolve the issue of the "arbitrary" 95%. The minute you convert the probability distribution into a point or interval estimate, we are back where we started.

As I said in my response, we shouldn't give him a pass. We may interpret Siegfried's subject as misapplication of statistics because we want to believe that's what he's talking about. But the words are clear: he's shooting for the mother lode.

Finally, William the engineer used "special relativity" as an example of "much of science that doesn't use statistics". I was trained as an engineer originally, and while I agree that engineering education does not touch stochastics till an advanced stage, it is not true that statistics don't inform engineering. I can name many areas where statistics are heavily used: queuing, telecoms networks, coding theory, call center operations, transportation, statistical mechanics, etc. The entire cottage industry around "six sigma" is also about statistics.

There was also a discussion of this article on <a>Slashdot where a few actual statisticians weighed in….

I'd like to weigh in on that engineering thing…

in Civil Engineering, as practiced, all the statistics have been taken out of the calculations by the researchers who decided what number to put in the table… But that decision itself can thought of as an approximation of deciding on a probability threshold for safety and then finding a number deep enough into the tail of the distribution that you're pretty well assured of that level of safety.

In the research community though, we talk about things like how climate change might change the 100 year flood, or how to analyze the in-situ strength of a structure that has been through several earthquakes. There is no way to duck statistics in these kinds of calculations, but those are not calculations you will see in practicing engineer's offices. They will wait for the researchers to give them a new table of 100 year floods or formulas for residual strength… in which the statistics will be embodied.

.

"But as statisticians, we focus on sampling error because that's our unique contribution to the endeavor."

And also because non-sampling error is a lot harder problem.

I'm actually fairly happy to see diverse opinions and takes on a statistically issue. At the risk of being the local Pollyanna, these does seem to suggest a broad group of people with interesting thoughts so who knows where it can all go?

:-)

In terms of sampling versus non-sampling error, I often work with large databases where sampling error has a minor contribution compared with non-sampling error (confounding and bias). The local solution seems to be sensitivity analysis (which has a set of issues of it's own but at least gives a sense for how much changes in assumptions can change estimates of association).

But it would be cool to have a certainty index to measure "expected bias plus sampling error"; of course, one would then need a measure of overall unmeasured confounding (which I think is an insoluable problem).

So I agree with zbicylist that the non-sampling error problem is hard!

Andrew

Your critical remark aimed at me is unfair.

You know I was discussing the mathematical

definition of a confidence interval.

By definition, 95 percent of confidence

intervals will contain the true value.

In practice, they might not because of biased

estimators etc.

But I was remarking about the fact that even in the idealized case where they do have correct coverage,

subjective Bayesians will dismiss the coverage is

irrelevant.

Please don't misquote me.

Thanks

Larry

Larry:

First off, I'm not a "subjective Bayesian" and never have been, and I have no interest in defending their ideas. We probably agree 100% (or, at least 50% or 90%) on that.

Second, I certainly wouldn't want to misquote you and I don't think I did here. I'm a very careful quoter. Above, I wrote that you are thinking that, in particle physics, 95 percent of published 95% intervals will actually contain the truth. Here's the full quote from your 2008 paper:

This is not merely a discussion of a mathematical definition; it's a claim about what is happening in particle physics.

(Regarding our natural tendency to understand theoretical principles in light of our own particular applied experiences, see the second and third meta-principles of statistics discussed here.)

That quote was a thought experiment where I was

trying to explain the consequences of giving up coverage.

Andrew – the quote was either out of context or Larry was a bit careless when he wrote this – he knew better even when he was a graduate student.

Liked this excerpt from your "here link"

"But sometimes that is because we are

careful in deciding what part of the model is “the likelihood.” Nowadays, this is starting to have real practical consequences even in Bayesian inference, … depend crucially on how the model is partitioned into likelihood, prior, and hyperprior distributions."

Keith

Keith: Larry's quote came from his discussion of my Bayesian Analysis article in 2008. I have definitely not taken it out of context. It didn't look like a thought experiment either ("The particle physicists have left a trail of such confidence intervals in their wake. . . .") but I'll take Larry's word for that.

Larry, are you maybe thinking about the back-and-forth you had with Radford Neal? There, your very first comment distinguishes the concept of coverage in an ideal world from the performance of confidence intervals in the real world. The response to Andrew's article is missing this distinction. Unfortunately, you've left yourself open to being quoted as holding an unrealistic view of coverage, even though the discussion at Radford Neal's blog shows you never intended that.