# Small numbers in chi-square and G-tests

Chi-square and G-tests of goodness-of-fit or independence give inaccurate results when the expected numbers are small. For example, if you observe 11 people with torn anterior cruciate ligaments, and 9 have torn their right ACL and 2 have torn their left ACL, you would compare the observed ratio to an expected 1:1 ratio to see if there's evidence that people are more likely to tear one ACL than the other. The exact binomial test gives a P-value of 0.065, the chi-square test of goodness-of-fit gives a P-value of 0.035, and the G-test of goodness-of-fit gives a P-value of 0.028. If you analyzed the data using the chi-square or G-test, you would conclude that people tear their right ACL significantly more than their left ACL; if you used the exact binomial test, which is more accurate, the evidence would not be quite strong enough to reject the null hypothesis.

When the sample sizes are too small, alternatives to the chi-square test or G-test are recommended. However, how small is "too small"? The conventional rule of thumb is that if all of the expected numbers are greater than 5, it's acceptable to use the chi-square or G-test; if an expected number is less than 5, you should use an alternative, such as an exact test or a randomization test for goodness-of-fit, or a Fisher's exact test or randomization test of independence.

This rule of thumb is left over from the olden days, when the calculations necessary for an exact test were exceedingly tedious and error-prone, and a randomization test would have required flipping actual coins or rolling actual dice thousands of times. Now that we have these new-fangled gadgets called computers, it's time to retire the "expected less than 5" rule. But what new rule should you use?

Here is a graph of relative P-values versus sample size. For each sample size, a pair of numbers were found that would give a P-value for the exact binomial test (null hypothesis, 1:1 ratio) that was as close as possible to P=0.05 without going under it. For example, with a sample size of 11, the numbers 9 and 2 give a P-value of 0.065. The chi-square test was then done on these numbers, and the chi-square P-value was divided by the exact binomial P-value. For 9 and 2, the chi-square P-value is 0.035, so the ratio is 0.035/0.065 = 0.54. In other words, the chi-square test gives a P-value that is only 54 percent as large as the more accurate exact binomial test. The G-test gives almost the same results as the chi-square test.

 P-values of chi-square and G-tests, as a proportion of the P-value from the exact binomial test.
 P-values of chi-square tests, as a proportion of the P-value from the exact binomial test.

Plotting these relative P-values vs. sample size (chi-square in black, G-test in green), it is clear that the chi-square and G-tests give P-values that are too low, even for sample sizes in the hundreds. This means that if you use a chi-square or G-test of goodness-of-fit and the P-value is just barely significant, you will reject the null hypothesis, even though the more accurate P-value of the exact binomial test would be above 0.05. The results are similar for 2×2 tests of independence; the chi-square and G-tests give P-values that are considerably lower than that of the more accurate Fisher's exact test.

### Yates' and William's corrections

One solution to this problem is to use Yates' correction for continuity, sometimes just known as the continuity correction. To do this, you subtract 0.5 from each observed value that is greater than the expected, add 0.5 to each observed value that is less than the expected, then do the chi-square or G-test. This only applies to tests with one degree of freedom: goodness-of-fit tests with only two categories, and 2×2 tests of independence. It works quite well for goodness-of-fit, yielding P-values that are quite close to those of the exact binomial. For tests of independence, Yates' correction yields P-values that are too high.

Another correction that is sometimes used is Williams' correction. For a goodness-of-fit test, Williams' correction is found by dividing the chi-square or G values by the following:

q=1+(a2−1)/6nv

where a is the number of categories, n is the total sample size, and v is the number of degrees of freedom. For a test of independence with R rows and C columns, Williams' correction is found by dividing the chi-square or G value by the following:

q=1+(n{[1/(row 1 total)]+…+[1/(row R total)]}−1)(n{[1/(column 1 total)]+…[1/(column C total)]}−1)/ 6n(R−1)(C−1)

Unlike Yates' correction, it can be applied to tests with more than one degree of freedom. For the numbers I've tried, it increases the P-value a little, but not enough to make it very much closer to the more accurate P-value provided by the exact binomial or Fisher's exact test.

Some software may apply the Yates' or Williams' correction automatically. When reporting your results, be sure to say whether or not you used one of these corrections.

### Pooling

When a variable has more than two categories, and some of them have small numbers, it often makes sense to pool some of the categories together. For example, let's say you want to compare the proportions of different kinds of ankle injuries in basketball players vs. volleyball players, and your numbers look like this:

sprains1816
breaks135
torn ligaments97
cuts35
puncture wounds13
infections20

The numbers for cuts, puncture wounds, and infections are pretty small, and this will cause the P-value for your test of independence to be inaccurate. Having a large number of categories with small numbers will also decrease the power of your test to detect a significant difference; adding categories with small numbers can't increase the chi-square value or G-value very much, but it does increase the degrees of freedom. It would therefore make sense to pool some categories:

sprains1816
breaks135
torn ligaments97
other injuries68

Depending on the question you're interested in, it might make sense to pool the data further:

orthopedic injuries4028
non-orthopedic injuries68

It is important to make decisions about pooling before analyzing the data. In this case, you might have known, based on previous studies, that cuts, puncture wounds, and infections would be relatively rare and should be pooled. You could have decided before the study to pool all injuries for which the total was 10 or fewer, or you could have decided to pool all non-orthopedic injuries because they're just not biomechanically interesting.

### Recommendations

Goodness-of-fit with two categories: Use the exact binomial test for sample sizes of 1000 or less. Spreadsheets, web pages and SAS should have no problem doing the exact binomial test for sample sizes less than 1000, but they may not be able to handle the calculations for larger sample sizes. For sample sizes greater than 1000, use the chi-square or G-test of goodness-of-fit with Yates' correction (unless you are doing a replicated G-test of goodness-of-fit, in which case you must use the G-test without any continuity correction).

Goodness-of-fit with more than two categories: Use exact tests or randomization tests for sample sizes of 1000 or less. Try the exact tests first, but if the program you're using can't handle it, use randomization tests. Use the chi-square or G-test of goodness of fit for sample sizes greater than 1000. Don't use Williams' correction. If the total sample size is greater than 1000, but some expected numbers are small (less than 5), use randomization tests. Consider pooling rare categories.

2×2 test of independence: Use Fisher's exact test for sample sizes up to 1000. Use the chi-square or G-test of independence, with Yates' correction, for sample sizes greater than 1000.

Greater than 2×2 test of independence: Use either an exact test or randomization test for total sample sizes of 1000 or less. Try the exact test first, but if the program you're using can't handle it, use a randomization test. Use a chi-square or G-test of independence, without Williams' correction, for sample sizes greater than 1000.

Sokal and Rohlf, pp. 698-703, 729-730.

Zar, pp. 470, 504-505.