William Perkins, Mark Tygert, and Rachel Ward write:

If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson χ2 statistic can involve division by nearly zero. This often leads to serious trouble in practice — even in the absence of round-off errors . . .

The problem is not merely that the chi-squared *statistic* doesn’t have the advertised chi-squared *distribution*—a reference distribution can always be computed via simulation, either using the posterior predictive distribution or by conditioning on a point estimate of the cell expectations and then making a degrees-of-freedom sort of adjustment.

Rather, the problem is that, when there are lots of cells with near-zero expectation, the chi-squared test is mostly noise.

And this is not merely a theoretical problem. It comes up in real examples.

Here’s one, taken from the classic 1992 genetics paper of Guo and Thomspson:

And here are the expected frequencies from the Guo and Thompson model:

The p-value of the chi-squared test is 0.693. That is, nothing going on.

But it turns out that that if you do an equally-weighted mean square test (rather than chi-square, which weights each cell proportional to expected counts), you get a p-value of 0.039. (Perkins, Tygert, and Ward compute the p-value via simulation.) Rejection!

This is no trick. All those zeroes and near-zeroes in the data give you a chi-squared test that is so noisy as to be useless. If people really are going around saying their models fit in such situations, it could be causing real problems.

Here’s what’s going on. The following graph shows the discrepancies ((Observed – Expected)/sqrt(Expected)) which are squared and summed to form the chi-squared statistic:

Only one of the cells is really bad—that point on the lower-right—but it has a high expected value (it’s one of the largest cells in the table), and when you take the equally-weighted mean square (which is equivalent to weighting the contributions to the chi-square in proportion to the expected count), you get a big total value. In the chi-squared statistic, all that noise in the empty cells is diluting the signal.

Those frequency tables look an awful lot like … monkey cages.

Is this problem due to the combination of large and small expected values, or just small expected values?

I often have tables with lots of small values. It seems to me that that would get rid of the problem you show in the bottom graph, as no cell will have such large expected values.

What about this problem w/Fisher’s exact test. Is the scenario as equally troublesome?

I believe that the Guo and Thompson paper referenced is arguing for an exact test (or at least a monte carlo approximation thereof.) As they are dealing with what is essentially a folded contingency table, the test is slightly different from Fisher’s, but the idea is the same.

Yes, the issue here is not so0called exact tests (that is, the exact distribution of the test statistic under a (perhaps unreasonable) sampling distribution) but rather the choice of test statistic.

What about G statistics? I’ve been told they have much better properties than chi-square for small expected values. Would that solve the problem?

G-statistics (all statistics which, like Pearson’s chi-square, asymptotically follow a chi-square distribution) overweight small bins, and exhibit similar problems from this. For instance, G-statistics also misreport significance on the example from Guo and Thompson (although not quite as badly as chi-square). We give several more examples in our paper.

G statistics which are generalized log likelihood ratio tests based on the multinomial distribution without using the normal distribution approximation can do much, much better than Peason’s chi^2 test for many applications. My own experience (see here http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and the associated 1991 paper). Whether it would help in a larger table is not clear.

The real strength of the G statistics is that they give pretty good results without the cost of Fisher’s exact test. For text retrieval and recommendations, they are able to provide superb performance, not so much as a test of significance but more as a prioritization score.

The problem pointed out in the blog posting arises only in tests with a nontrivial number of degrees of freedom, if I understand correctly. For the relatively large table of the blog posting, the P-value for G is .600 — not so different from the P-value for chi-squared, which is .693. The P-value for the equally weighted mean square is much smaller — .039. So G would seem to have the same problem as chi-squared (which is not surprising, as both have the same limiting distribution when the number of draws is large and the null hypothesis is true). But this is only an issue when there are a nontrivial number of degrees of freedom. Testing for independence in a 2×2 contingency-table/cross-tabulation clearly calls for special methods (possibly including confidence intervals or Bayesian inference), though G can be good for 2×2 tables.

It seems that one exception to this is when you have a good reason to expect the small cells to have the largest effects, as can be the case in genetics, where the variants with the largest effects should be rare.

Reference for equally-weighted mean square test?

Sorry—I forgot to include the link. I’ve added it now (see first line of the post).

To the Fisher Exact comment by John …

The premise of Fisher Exact is comparing the expected count in a cell with the count predicted by assuming independence, that is, the product of the marginals. If the horizontal or vertical marginal is zero, then nothing meaningful can be said, but, then, the cell must necessarily also be zero. This means nothing has been observed about the factor in consideration, so the test has no statement regarding it.

I looked at the scatter plot and believe the null should be true. I think it is a desirable feature that not a single cell dominates the test statistics.

besides, I think given this sample size, a p-value of 0.039 tells me the real effect is small. I would say it’s very close to zero, rather than emphasizing its super narrow 95% CI does not cover zero.

It is kind of hard to say. Even just the discrepancy in the bin for (j=4, k=1) is enormous.

Wouldn’t it be less sensationalist to say that Pearson’s chi-square is less sensitive than another proposed test statistic to some kinds of departure from the null hypothesis, but more sensitive to others? You pays your money & you takes your choice.

That is undeniably true, though the posted version is more fun, no? The real issue seems to be that most classical statistics for discrete goodness-of-fit use relative discrepancies, whereas the unweighted mean-square uses absolute discrepancies. This makes for a very big difference in performance, much like the difference between the Kolmogorov-Smirnov and Anderson-Darling statistics. This difference seems to be much bigger in practice than the difference in performance between chi-squared and the maximum relative discrepancy, for example (chi-squared is the sum of the squares of the relative discrepancies).

Also, it’s not like people are “paying their money and taking their choice.” Chi-square is typically just used as a default. If it generally performs poorly when there are a lot of noisy cells with expectations near zero, that could be good to know.

I’d have thought the default method would be to collapse categories until each cell had an decent expected value & calculate a p-value from the asymptotic distribution of Pearson’s chi-square. The motivation may usually be to make the approximation good, but doing it in this case will have the same effect of lowering the p-value – by reducing the number of comparisons in the omnibus test.

If someone’s going to keep all the categories and take the trouble of getting a computer to simulate the distribution of the test statistic under the null hypothesis then it’s likely that they’re *interested* in what’s going on in those low-expection cells. If so, the equally-weighted mean square won’t be of much use to them & Pearson’s chi-square will be. Of course they’ll see reduced sensitivity in the high-expectation cells; but this is just a case of, rather than the test performing poorly, not being able to have your cake & eat it.

I’m not rubbishing the equally-weighted mean square, but which test statistic to choose depends on what you’re asking it to do for you, not on whether a lot of cells have low expectations or not.

Note that rebinning so that every bin has at least five draws is not sufficient to lower the P-value all the way down to the level of the equally weighted mean square — in general the rebinning has to uniformize the expected values; uniformizing can require a great deal of rebinning, reducing the power of the test.

Regarding implementation, note that coding up a Monte-Carlo simulation to compute the P-value is totally trivial. It is rebinning that requires careful extra work from the user. And, of course, rebinning can improperly manipulate the power of the test.

As for defaults, an appealing option is to use both the equally weighted mean square and a classical statistic such as the likelihood ratio or chi-squared. For distributions whose cumulative distribution functions are continuous, the Kolmogorov-Smirnov, Cramer-von-Mises, or Kuiper statistics are the defaults. These are analogous to the equally weighted mean square. The continuous analogue of chi-squared is the Anderson-Darling statistic (or maybe the Renyi statistic), which is very useful for special circumstances, but is generally considered to concentrate its power in somewhat undesirable directions for most practical applications. It is not clear that the defaults for discrete distributions should be so different from the accepted defaults for the continuous case, now that computers are widely available.

I was talking about what people actually do by default. For most, Monte Carlo simulation won’t be totally trivial until it’s a button on the Minitab or SPSS toolbar. So they bin the data & thereby mitigate the effect of cells with low expectations – whether this is a good thing or not.

As to what a default method *should* be, it’s a tricky question & your answer (equally-weighted mean square and likelihood ratio tests) is perhaps a little glib; performing both tests at 5%, we’d reject the null more than 5% of the time. Any particular sample can be considered extreme in some way or another (D.R. Cox, I think), so it doesn’t cut it to simply present an example data-set & say:- Test A detects a discrepancy, test B doesn’t, so test A is better. Test B will necessarily detect discrepancies that test A doesn’t. This is why I don’t like the talk of chi-square failing or misreporting significance levels.

To compare the power of different tests you need to define an alternative hypothesis, which is what you don’t really have for a default goodness-of-fit significance test. A practical way is to compare power using a range of alternative hypotheses that purport to be widely relevant in applications. A principled way is to construct a very general alternative encompassing various types of departure from the null & ordering samples by the likelihood ratio – when the alternative is that each cell has its own multinomial probability, this is the G-statistic or likelihood ratio test. I’ve seen the practical way used to compare G.O.F. tests for continuous distributions, but not for discrete distributions or contingency tables. The logic of the L.R.T. is appealing when you’re ‘neutral’ about sample ordering – samples that have a higher ratio of probability under the best-fit alternative to probability under the best-fit null are more extreme. If on the other hand you’re more interested in the meaty part of the data’s p.m.f. than in the tail, your equally-weighted mean square looks like the way to go.

The following is a supplementary example that illustrates just what kind of rebinning the classical statistics sometimes require in order to work well (many further classes of practical examples are available at my website). The usual rules for rebinning are insufficient.

We would like to gauge whether the results of two polls from Chapter 4 of Andersen (1990) can reasonably be assumed to be the same up to expected random fluctuations:

Poll1 Poll2

————-

A 416 268

B 45 22

C 338 160

E 13 6

F 131 66

K 18 10

M 47 16

Q 20 8

V 129 92

Y 22 9

Z 76 32

————-

1255 689

The following are the exact P-values computed via four million Monte-Carlo simulations for testing whether the proportions of polled voters can reasonably be assumed to be the same for both polls (fixing the number of polled voters in each poll during the simulations):

chi-square: .0868

log-likelihood-ratio (G): .0906

Freeman-Tukey: .0959

negative log-likelihood: .0905

equally weighted mean square: .00838

The equally weighted mean square indicates that the polls gave somewhat different results, whereas the classical statistics are equivocal. The P-value for the equally weighted mean square is over an order of magnitude smaller.