McCloskey et al. on significance testing in economics

Now that we’re on the topic of econometrics . . . somebody recommended to me a book by Deirdre McCloskey. I can’t remember who gave me this recommendation, but the name did ring a bell, and then I remembered I wrote some other things about her work a couple years ago. See here.

And, because not everyone likes to click through, here it all is again:

Scott Cunningham writes,

Today I was rereading Deirdre McCloskey and Ziliak’s JEL paper on statistical significance, and then reading for the first time their detailed response to a critic who challenged their original paper. I was wondering what opinion you had about this debate. Is statistical significance and Fisher tests of significance as maligned and problematic as McCloskey and Ziliak claim? In your professional opinion, what is the proper use of seeking to scientifically prove that a result is valid and important?

The relevant papers are:

McCloskey and Ziliak, “The Standard Error of Regressions,” Journal of Economic Literature 1996.

Ziliak and McCloskey, “Size Matters: The Standard Error of Regressions in the American
Economic Review
,” Journal of Socio-Economics 2004.

Hoover and Siegler, “Sound and Fury: McCloskey and Significance Testing in Economics,” Journal of Economic Methodology, 2008.

McCloskey and Ziliak, “Signifying Nothing: Reply to Hoover and Siegler.”

My comments:

1. I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects–positive for some people, negative for others–but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, “while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance.” Certainly, I’d rather see an estimate with an assessment of statistical significance than an estimate without such an assessment.

2. Hoover and Siegler’s discussion of the logic of significance tests (section 2.1) is standard but, I believe, wrong. They talk all about Type 1 and Type 2 errors, which are irrelevant for the reasons described in point 1 above.

3. I agree with most of Hoover and Siegler’s comments in their Section 2.4, in particular with the idea that the goal in statistical inference is often not to generalize from a sample to a specific population, but rather to learn about a hypothetical larger population, for example generalizing to other schools, other years, or whatever. Some of these concerns can best be handled using multilevel models, especially when considering different possible generalizations. This is most natural in time-series cross-sectional data (where you can generalize to new units, new time points, or both) but also arises in other settings. For example, in our analyses of electoral systems and redistricting plans, we were careful to set up the model so that our probability distribution generalized to other possible elections in existing congressional districts, not to hypothetical new districts drawn from a common population.

4. Hoover and Siegler’s Section 2.5, while again standard, is I think mistaken in ignoring Bayesian approaches, which limits their “specification search” approach to the two extremes of least squares or setting coefficients to zero. They write, “Additional data are an unqualified good thing, which never mislead.” I’m not sure if they’re being sarcastic here or serious, but if they’re being serious, I disagree. Data can indeed mislead on occasion.

Later Hoover and Siegler cite a theorem that states “as the sample size grows toward infinity and increasingly smaller test sizes are employed, the test battery will, with a probability approaching unity, select the correct specification from the set. . . . The theorem provides a deep justification for search methodologies . . that emphasize rigorous testing of the statistical properties of the error terms.” I’m afraid I disagree again–not about the mathematics, but about the relevance, since, realistically, the correct specification is not in the set, and the specification that is closest to the ultimate population distribution should end up including everything. A sieve-like approach seems more reasonable to me, where more complex models are considered as the sample size increases. But then, as McCloskey and Ziliak point out, you’ll have to resort to substantive considerations to decide whether various terms are important enough to include in the model. Statistical significance or other purely data-based approaches won’t do the trick.

Although I disagree with Hoover and Siegler in their concerns about Type 1 error etc., I do agree with them that it doesn’t pay to get too worked up about model selection and its distortion of results–at least in good analyses. I’m reminded of my own dictum that multiple comparisons adjustments can be important for bad analyses but are not so important when an appropriate model is fit. I agree with Hoover and Siegler that it’s worth putting in some effort in constructing a good model, and not worrying if said model was not specified before the data were seen.

5. Unfortunately my copy of McCloskey and Ziliak’s original article is not searchable, but if they really said, “all the usual econometric problems have been solved”–well, hey, that’s putting me out of a job, almost! Seriously, there are lots of statistical (thus, I assume econometric) problems that are still open, most notably in how to construct complex models on large datasets, as well as more specific technical issues such as adjustments for sample surveys and observational studies, diagnostics for missing-data imputations, models for time-series cross-sectional data, etc etc etc.

6. I’m not familiar enough with the economics to comment much on the examples, but the study of smoking seems pretty wacky to me. First there is a discussion of “rational addiction.” Huh?? Then Ziliak and McCloskey say “cigarette smoking may be addictive.” Umm, maybe. I guess the jury is still out on that one . . . .

OK, regarding “rational addiction,” I’m sure some economists will bite my head off for mocking the concept, so let me just say that presumably different people are addicted in different ways. Some people are definitely addicted in the real sense that they want to quit but they can’t, perhaps others are addicted rationally (whatever that means). I could imagine fitting some sort of mixture model or varying-parameter model. I could imagine some sort of rational addiction model as a null hypothesis or straw man. I can’t imagine it as a serious model of smoking behavior.

7. Hoover and Siegler must be correct that economists overwhelmingly understand that statistical and practical significance are not the same thing. But Ziliak and McCloskey are undoubtedly also correct that most economists (and others) confuse these all the time. They have the following quote from a paper by Angrist: “The alternative tests are not significantly different in five out of nine comparisons (p<0.02), but the joint test of coefficient equality for the alternative estimates of theta.t leads to rejection of the null hypothesis of equality." This indeed does not look like good statistics. Similar issues arise in the specific examples. For instance, Ziliak and McCloskey describe where Becker, Grossman, and Murphy summarize their results in terms of t-ratios of 5.06, 5.54, etc, which indeed miss the point a bit. But Hoover and Siegler point out that Becker et al. also present coefficient estimates and interpret them on relevant scales. So they make some mistakes but present some things reasonably. 8. People definitely don't understand that the difference between significant and not significant is not itself statistically significant.

9. Finally, what does this say about the practice of statistics (or econometrics)? Does it matter at all, or should we just be amused by the gradually escalating verbal fireworks of the McCloskey/Ziliak/Hoover/Siegler exchange? In answer to Scott’s original questions, I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler’s attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference and also from expanding models to include treatment interactions.

P.S. See here for more.

After I posted this discussion of articles by McCloskey, Ziliak, Hoover, and Siegler, I received several interesting comments, which I’ll address below. The main point I want to make is that the underlying problem–inference for small effects–is hard, and this is what drives much of the struggles with statistically significance. See here for more discussion of this point.

Statisticians and economists not talking to each other

Scott Cunningham wrote, surprised that I’d not heard of these papers before:

I wasn’t expecting anything like what you wrote. I live in a bubble, and just assumed you were familiar with the papers, because in grad school, whenever I presented results and said something was significant (meaning statistically significant), I would *always* get someone else responding, “but is it _economically_ significant” meaning, at minimum, is the result basically a very precisely measured no effect? The McCloskey/Ziliak stuff was constantly being thrown at you by the less quantitatively inclined people (that set is growing smaller all the time), and I forgot for a moment that those papers probably didn’t generate much interest outside economics.

I live in a bubble too, just a different bubble than Scott’s. He and others might be interested in this article by Dave Krantz on the null hypothesis testing controversy in psychology. Dave begins his article with:

This article began as a review of a recent book, What If There Were No Significance Tests? . . . The book was edited and written by psychologists, and its title was well designed to be shocking to most psychologists. The difficulty in reviewing it for [statisticians] is that the issue debated may seem rather trivial to many statisticians. The very existence of two divergent groups of experts, one group who view this issue as vitally important and one who might regard it as trivial, seemed to me an important aspect of modern statistical practice.

As noted above, I don’t think the issue is trivial, but it is true that I can’t imagine an article such as McCloskey and Ziliak’s appearing in a statistical journal.

Rational addiction

Scott also writes,

BTW, the rational addiction literature is a reference to Gary Becker and Kevin Murphy’s research program that applies price theory to seemingly “non-market phenomenon”, such as addiction. Rational choice would seem to break down as a useful methodology when applied to something like addiction. Becker and Murphy have a seminal paper on this from 1988. It’s been an influential paper in the area of health economics, as numerous papers have followed it by estimating various price elasticities of demand, as well as to test the more general theory regarding the theory.

My reply to this: Yeah, I figured as much. It’s probably a great theory. But, ya know what? If Becker and Murphy want to get credit for being bold, transgressive, counterintuitive, etc etc., the flip side is that they have to expect outsiders like me to think their theory is pretty silly. As I noted in my previous entry, there’s certainly rationality within the context of addiction (e.g., wanting to get a good price on cigarettes), but “rational addiction” seems to miss the point. Hey, I’m sure I’m missing the key issue here, but, again, it’s my privilege as a “civilian” to take what seems a more commonsensical position here and leave the counterintuitive pyrotechnics to the professional economists.

The paradigmatic example in economics is program evaluation?

Mark Thoma “disagreed mildly” with my claim that the null hypothesis of zero coefficient is essentially always false. Mark wrote:

I don’t view the “paradigmatic example in economics” to be program evaluation. We do some of that, but much of what econometricians do is test the validity of alternative theories and in those contexts the hypothesis of a zero coefficient can make sense. For example, New Classical models imply that expected changes in the money supply should not impact real variables. Thus, a test of a zero coefficient on expected money in an equation with a real activity as the dependent variable is a test of the validity of the New Classical model’s prediction. These tests requires sharp distinctions between models, i.e. to find variables that can impact other variables in one theory but not another, and that’s something we try hard to find, but when such sharp distinctions exist I believe classical hypothesis tests have something useful to contribute.

Hmmm . . . . I’ll certainly defer to Mark on what is or is not the paradigmatic example in economics. I can believe that theory testing is more central. I’ll also agree that important theories do have certain coefficients set to zero. I doubt, however, that in actual economic data, such coefficients really would be zero (or, to be more precise, that coefficient estimates would asymptote to zero as sample sizes increase). To wander completely out of my zone of competence and comment on Mark’s money supply example: I’m assuming this is somewhat of an equilibrium theory, and short-term fluctuations in expected money supply could affect individual actors in the economy, which could then create short-term outcomes, which would show up in the data in some way (and then maybe, in good “normal science” fashion, be explained in a reasonable way to preserve the basic New Classical model). What I’m saying is: in the statistics, I don’t think you’d really be seeing zero, and I don’t think the Type 1 / Type 2 error framework is relevant.

Getting better? And a digression on generic seminar questions

Justin Wolfers writes that “the meaningless statements of statistical rather than economic significance are declining.” Yeah, I think things must be getting better. Many years ago, Gary told me that his generic question to ask during seminars was, “What are your standard errors.” Apparently in poli sci, that used to stop most speakers in their tracks. We’ve now become much more sophisticated–in a good way, I think. (By the way, it’s good to have a few of these generic questions stored up, in case you fall asleep or weren’t paying attention during the talk. My generic questions include: “Could you simulate some data from your fitted model and see if they look like your observed data?” and “How many data points would you have to remove for your effect estimate to go away?”

Justin uses a lot of bold type in his blog entries. What’s with that? Maybe a good idea? I use bold for section headings, but he uses them all over the place.

Sports examples

Also, since I’m responding to Justin, let me comment on his use of sports as examples in his classes. I do this too–heck, I even wrote a paper on golf putting, and I’ve never even played macro-golf–but, as people have noted on occasion, you have to be careful with such examples because they exclude many people who aren’t interested in the topic. (And, unlike examples in biology, or economics, or political science, it’s harder to make the case that it’s good for the students’ general education to become more familiar with the statistics of basketball or whatever.) So: keep the sports examples, but be inclusive.

13 thoughts on “McCloskey et al. on significance testing in economics

  1. "Statistically insignificant" (often erroneously asserted since a specification search leads to the assertion) and "practical significance" are quite different but often the first is used to justify the latter as a conclusion. Replicability requires retesting with a different sample to see if the "statistically significant" conclusion holds up; but, the replication is rarely done. The employment of Occam's Razor to practically insignificant but statistically significant results (and thus to those which are prima facie replication -suspect) is in limited use.

  2. reposted to correct egregious typo:

    "Statistically significant" (often erroneously asserted since a specification search leads to the assertion) and "practical significance" are quite different but often the first is used to justify the latter as a conclusion. Replicability requires retesting with a different sample to see if the "statistically significant" conclusion holds up; but, the replication is rarely done. The employment of Occam's Razor to practically insignificant but statistically significant results (and thus to those which are prima facie replication -suspect) is in limited use.

  3. I've also noticed that fewer people harp on "economic vs. statistical significance" in seminars. In fact, I never hear it. But I wonder if it's because in applied micro, the concerns are usually of a certain kind having to do with endogeneity of regressors, first-stage F-tests for instrumental variables, and things like that. I don't know if the problem that McCloskey worried about has gone away, so much as it is that nowadays, the Angrist school of thought dominates applied micro empirical talks, and your main concern is first and foremost to establish a causal effect. Having statistical significance does not mean you have done that, for instance. I wonder, in other words, if Justin isn't describing the more general shift in applied micro empirical work, which is more closely focused on treatment evaluation, now more than ever.

  4. Andrew, the link to the Krantz paper is broken. Could you fix that please?

    I also had the misfortune to purchase McClosky et.al. book on the "Cult of Significance Testing." I found it to be a mismash of hysteria and misquoting, mixed with a lot of confusion.

  5. Regarding point 1, why would you rather have report of a significance test than not? Given that you think true null effects probably don't exist, it seems like the significance test can't be interpreted except as a monotonic transform of effect size. The problem is the transform depends on sample size, which makes it difficult to interpret the p value.

    Krantz's statement also confuses me, since I know of exchanges on this subject in the stat literature (such as Berger and Delampady, 1987, in Statistical Science, and responses). Berger and Delampady make the strong claim that p values should be abandoned, inspiring quite an interesting discussion. The authors in that exchange didn't seem to regard the issue as trivial.

    As a psychologist, I believe that reliance on p values slow good model development. When you can find a significant p value for anything just by increasing sample size (because we know researchers often run experiments until an effect becomes significant…), you end up with all kinds of researchers asking "what about this effect? What about that effect?" When you have to include everything including the kitchen sink in your model, things are going to get crazy…

  6. "I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler's attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference"

    I think a key big point is that to not be mislead by the data it's important to look at many aspects of it, not just one statistic, or even a few, if possible.

    Statistical significance, a p or t value, is just one view of the data, and it can often be misleading. At the very least a confidence interval (or several) should always be looked at. With that you can never ridiculously mistake statistical significance for economic significance.

    In finance confidence intervals are rarely presented along with p or t values. I will often construct them myself quickly in my head with a widely applicable approximately 95% confidence interval formula: BetaHat +- (2/TValue)BetaHat

    A great thing about Bayesian work is that you often get histograms. Histograms and percentiles are incredible for really understanding data and not being mislead by it. I always tell my business statistics students try to get a histogram and/or percentiles if you really want to not be mislead, as you can be with just one or two, or even a few statistics.

  7. Richard

    You might wish to look at Stat Sci 2004 A conversation with James O Berger – especially the top on page 216

    Anonymous

  8. Anonymous,

    I looked at that article. On page 216 Berger writes:

    That’s a tough one. I think most statisticians will agree that p-values are widely misused. At the moment, however, our profession does not agree on an alternative to the use of p-values for testing and, until we do, it is going to be hard to make progress on eliminating the misuses.

    He doesn't give a specific solution, but I think confidence intervals are a good suggestion. Putting at least 95% confidence interval next to each p-value would take up little space; it's just two numbers with a comma between them, but it would make ridiculous economic significance mistakes virtually impossible. It would really add a lot of important understanding and prevent a lot of confusion and misleading.

  9. Richard:

    I did say the "top" of page 216 …

    Good classical statisticians do realize what the p_value is answering, and become adept at understanding how to intuitively address the "real question"

    but it surely needs to understood in the context of the whole interview.

    Anonymous

  10. Statistical significance, a p or t value, is just one view of the data, and it can often be misleading. At the very least a confidence interval (or several) should always be looked at. With that you can never ridiculously mistake statistical significance for economic significance.

Comments are closed.