When to use it
Use the randomization test of goodness of fit when you have one nominal variable with three or more values (such as red vs. pink vs. white flowers), and the sample size is too small to do the chi-square test or the G-test of goodness-of-fit. An exact multinomial test would be just as good as a randomization test; I include the randomization test here because you'll find it difficult to do an exact multinomial test if you don't have access to SAS or another statistical package, and because it provides a simple introduction to the important topic of randomization-based tests (also known as Monte Carlo simulations).
The first step in doing a randomization test is to calculate the test statistic, in this case the chi-square statistic. This is a measure of how far the observed numbers are from the expected; a bigger deviation from the expected leads to a bigger chi-square statistic. When doing a chi-square test, you use the relationship that under the null hypothesis, the chi-square statistic approximately follows the mathematical chi-square distribution. With small expected numbers, this approximate relationship is not very accurate, which is why the randomization test is necessary.
For the spreadsheet and web page described here, the null hypothesis must be extrinsic (such as an expected 1: 2: 1 ratio in a genetic cross), not intrinsic (such as the p2: 2pq: q2 Hardy-Weinberg proportions of population genetics). If you want to do a randomization test with an intrinsic hypothesis, you will probably have to write a program yourself.
The statistical null hypothesis is that the number of observations in each category is equal to that predicted by a biological theory, and the alternative hypothesis is that the observed numbers are different from the expected. The null hypothesis is usually an extrinsic hypothesis, one for which the expected proportions are determined before doing the experiment. An example is a 1: 2: 1 ratio in a genetic cross.
How the test works
Imagine you did a cross in which you expected a 1:2:1 ratio of red to pink to white snapdragon flowers. You got only 8 offspring, so you expect 2 red, 4 pink, and 2 white, if the null hypothesis is true; you actually got 5 red, 2 pink, and 1 white.
You calculate the chi-square statistic, which is 6.00. That is significant (P=0.0498), but you know that the chi-square test can be inaccurate with such small expected numbers. So you put one red ball, two pink balls, and one white ball in a hat. Without looking, you reach in, grab a ball, and write down what color it is. You put the ball back in and repeat this process until you've sampled 8 balls from a known 1:2:1 ratio. You calculate the chi-square statistic for these numbers, and see whether it's as big or bigger than your observed chi-square statistic of 6.00.
You repeat this process of randomly sampling 8 balls from a known 1:2:1 ratio, and see how often you get a chi-square of 6.00 or larger. If you get a chi-square that large more than 5 percent of the time, it tells you that if the null hypothesis is true, you'll get your observed result (or something even more deviant from the null) more than 5 percent of the time, so you can't reject the null hypothesis. If the randomization trials produce a chi-square of 6.00 or larger less than 5 percent of the time, you reject the null hypothesis. For these numbers (5 red, 2 pink, 1 white), I get P=0.0576 after 10,000 randomizations (done on a computer, not with a hat), which is not significant.
Because it is taking a random sample of all possible combinations, the randomization test will give slightly different estimates of the P-value every time you run it. The more replicates you run, the more accurate your estimate of the P-value will be. You might want to start with a small number of replicates, such as 1,000, to be sure everything is working properly, then change the number of replicates to 100,000 or even 1,000,000 for your final result.
This randomization test of goodness-of-fit is an example of an important technique in statistics. Sometimes you want to estimate the probability of getting an observed result if a null hypothesis is true (the P-value), and you have a test statistic that measures how far the observations are from the expected, but there is no theoretical relationship between the test statistic and the P-value. If you can simulate on a computer taking random samples from a population that fits the null hypothesis, you can see how often the observed value of the test statistic occurs, and therefore estimate the P-value. This technique is often called "Monte Carlo simulation," because it's like selecting a bunch of random numbers with a roulette wheel in the casino there. More elaborate Monte Carlo simulations usually require writing a computer program or using specialized mathematical software, so they are beyond the scope of this handbook, but you should be aware of the general concept.
The red-breasted nuthatch example from the chi-square and G-test of goodness-of-fit pages has some rather small expected numbers; under the null hypothesis, you'd only expect 7.8 foraging events in grand fir and 1.6 events in western larch. The chi-square and G-tests might therefore be a little inaccurate, so it would be better to use a randomization test. Using SAS to run one MILLION replicate randomizations, the proportion of chi-square values for the randomly sampled data that were equal to or greater than the observed chi-squared value (13.59) was only 0.0069; in other words, P=0.0069. This is somewhat higher than the results for the chi-square test (P=0.0035) or G-test (P=0.0043), but it doesn't change the conclusion, that the foraging events are signficantly different from randomly distributed among the tree species.
Graphing the results
You plot the results of a randomization test of goodness-of-fit the same way you would a chi-square test of goodness-of-fit.
You have a choice of four goodness-of-fit tests: the exact binomial test or exact multinomial test, the G-test of goodness-of-fit,, the chi-square test of goodness-of-fit, or the randomization test. For small values of the expected numbers, the chi-square and G-tests are inaccurate, because the distribution of the test statistics do not fit the chi-square distribution very well.
The usual rule of thumb is that you should use the exact test or randomization test when the smallest expected value is less than 5, and the chi-square and G-tests are accurate enough for larger expected values. This rule of thumb dates from the olden days when statistics were done by hand, and the calculations for the exact test were very tedious and to be avoided if at all possible. Nowadays, computers make it just as easy to do the exact test or randomization test as the computationally simpler chi-square or G-test. I recommend that you use the exact test or randomization test when the total sample size is less than 1000. See the web page on small sample sizes for further discussion.
The exact test and randomization test should give you the same result, if you do enough replicates for the randomization test, so the choice between them is a matter of personal preference. The exact test sounds more "exact"; the randomization test may be easier to understand and explain. You can do the randomization test with a spreadsheet, web page, or simple homemade computer program; the exact test may require a sophisticated statistics program such as SAS.
If some of your expected numbers are too small to use the chi-square or G-test, but your total sample size is too big for an exact test, you may have to use a randomization test.
How to do the test
I've put together a spreadsheet that will perform the randomization test of goodness-of-fit for up to 10 categories and up to 100 observations. It does 200 replicates at a time; to get a decent number of replicates, you should copy the numbers from the cell labelled "reps. with greater chi-sq." into the row labelled "enter reps. with greater chi-sq." By entering these numbers in this row alongside each other, you can do up to 10,000 total replicates.
Go to this web page and enter your observed numbers in the first column. If you had 5 red flowers, 2 pink flowers and 1 white flower in a genetic cross, you would enter those numbers. You may then either enter the expected numbers (2, 4 and 2) in the second column, or enter the expected proportions (0.25, 0.50, and 0.25) in the third column.
Hit the Calculate button, and you'll get the chi-square test results; in this case, the chi-square statistic is 6.00. Then scroll down and hit the 200 Random Samples button. The web page will then use a random-number generator to choose a flower at random. The probability of it choosing a red flower is 0.25, a pink flower 0.50, and a white flower 0.25. It will do this for 8 flowers, then calculate the chi-square statistic for this simulated data set. Then it will repeat this, for a total of 200 simulated data sets. You'll probably want to do more that 200 replicates, to get a more accurate estimate of the P value; you should do at least 1000, by hitting the 200 Random Samples button four more times, or maybe 10,000 if you want publication-quality data.
To conduct a randomization test of goodness-of-fit in SAS, use the TABLES and EXACT commands with the CHISQ and MC options. Here's an example using the snapdragons. The testp=(25 50 25) option gives the expected percentages, which must add up to 100; you could also use the proportions, 0.25, 0.50 and 0.25. The order=data option tells SAS that the expected values are given for the values of "color" in the order they are input (red, pink, white).
data snapdragons; input color $ observed; cards; red 5 pink 2 white 1 ; proc freq data=snapdragons order=data; weight observed; tables color / chisq testp=(25 50 25); exact chisq / mc n=100000; run;
The output includes two p-values, one for the regular chi-square test and one for the randomization test.
Chi-Square Test for Specified Proportions -------------------------------------- Chi-Square 6.0000 DF 2 Asymptotic Pr > ChiSq 0.0498 Chi-square P-value Monte Carlo Estimate for the Exact Test Pr >= ChiSq 0.0594 Randomization P-value 99% Lower Conf Limit 0.0575 99% Upper Conf Limit 0.0613 Number of Samples 100000
I don't know how to do a precise power analysis for this test. Unless the numbers are very, very small, the P-values are fairly similar to those for a chi-square test of goodness-of-fit, so the power analysis described there should be adequate.
Sokal and Rohlf, p. 810.
This page was last revised September 13, 2009. Its address is http://udel.edu/~mcdonald/statrand.html. It may be cited as pp. 52-56 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.