# Exact test for goodness-of-fit

The main goal of a statistical test is to answer the question, "What is the probability of getting a result like my observed data, if the null hypothesis were true?" If it is very unlikely to get the observed data under the null hypothesis, you reject the null hypothesis.

Most statistical tests take the following form:

1. Collect the data.
2. Calculate a number, the test statistic, that measures how far the observed data deviate from the expectation under the null hypothesis.
3. Use a mathematical function to estimate the probability of getting a test statistic as extreme as the one you observed, if the null hypothesis were true. This is the P-value.

Exact tests, such as the exact test for goodness-of-fit, are different. There is no test statistic; instead, the probability of obtaining the observed data under the null hypothesis is calculated directly. This is because the predictions of the null hypothesis are so simple that the probabilities can easily be calculated.

### When to use it

You use the exact binomial test when you have one nominal variable with only two values (such as male or female, left or right, green or yellow). The observed data are compared with the expected data, which are some kind of theoretical expectation (such as a 1:1 sex ratio or a 3:1 ratio in a genetic cross) that is determined before the data are collected. If the total number of observations is too high (around a thousand), computers may not be able to do the calculations for the exact test, and a G-test or chi-square test of goodness-of-fit must be used instead (and will give almost exactly the same result).

You can do exact multinomial tests of goodness-of-fit when the nominal variable has more than two values. The basic concepts are the same as for the exact binomial test. Here I'm limiting the explanation to the binomial test, because it's more commonly used and easier to understand.

### Null hypothesis

For a two-tailed test, which is what you almost always should use, the null hypothesis is that the number of observations in each category is equal to that predicted by a biological theory, and the alternative hypothesis is that the observed data are different from the expected. For example, if you do a genetic cross in which you expect a 3:1 ratio of green to yellow pea pods, and you have a total of 50 plants, your null hypothesis is that there are 37.5 plants with green pods and 12.5 with yellow pods.

If you are doing a one-tailed test, the null hypothesis is that the observed number for one category is equal to or less than the expected; the alternative hypothesis is that the observed number in that category is greater than expected.

### How the test works

 Gus the cat.

Let's say you want to know whether our cat, Gus, has a preference for one paw or uses both paws equally. You dangle a ribbon in his face and record which paw he uses to bat at it. You do this 10 times, and he bats at the ribbon with his right paw 8 times and his left paw 2 times. Then he gets bored with the experiment and leaves. Can you conclude that he is right-pawed, or could this result have occurred due to chance under the null hypothesis that he bats equally with each paw?

The null hypothesis is that each time Gus bats at the ribbon, the probability that he will use his right paw is 0.5. The probability that he will use his right paw on the first time is 0.5. The probability that he will use his right paw the first time AND the second time is 0.5 x 0.5, or 0.52, or 0.25. The probability that he will use his right paw all ten times is 0.510, or about 0.001.

For a mixture of right and left paws, the calculation is more complicated. Where n is the total number of trials, k is the number of "successes" (statistical jargon for whichever event you want to consider), p is the expected proportion of successes if the null hypothesis is true, and Y is the probability of getting k successes in n trials, the equation is:

```
Y = pk(1-p)(n-k)n!
————————————
k!(n-k)!

```

Fortunately, there's an spreadsheet function that does the calculation for you. To calculate the probability of getting exactly 8 out of 10 right paws, you would enter

=BINOMDIST(2, 10, 0.5, FALSE)

The first number, 2, is whichever event there are fewer than expected of; in this case, there are only two uses of the left paw, which is fewer than the expected 5. The second number is the total number of trials. The third number is the expected proportion of whichever event there were fewer than expected of. And FALSE tells it to calculate the exact probability for that number of events only. In this case, the answer is P=0.044, so you might think it was significant at the P<0.05 level.

However, it would be incorrect to only calculate the probability of getting exactly 2 left paws and 8 right paws. Instead, you must calculate the probability of getting a deviation from the null expectation as large as, or larger than, the observed result. So you must calculate the probability that Gus used his left paw 2 times out of 10, or 1 time out of 10, or 0 times out of ten. Adding these probabilities together gives P=0.055, which is not quite significant at the P<0.05 level. You do this in a spreadsheet by entering

=BINOMDIST(2, 10, 0.5, TRUE).

The "TRUE" parameter tells the spreadsheet to calculate the sum of the probabilities of the observed number and all more extreme values; it's the equivalent of

=BINOMDIST(2, 10, 0.5, FALSE)+BINOMDIST(1, 10, 0.5, FALSE)+BINOMDIST(0, 10, 0.5, FALSE).

There's one more thing. The above calculation gives the total probability of getting 2, 1, or 0 uses of the left paw out of 10. However, the alternative hypothesis is that the number of uses of the right paw is not equal to the number of uses of the left paw. If there had been 2, 1, or 0 uses of the right paw, that also would have been an equally extreme deviation from the expectation. So you must add the probability of getting 2, 1, or 0 uses of the right paw, to account for both tails of the probability distribution; you are doing a two-tailed test. This gives you P=0.109, which is not very close to being significant. (If the null hypothesis had been 0.50 or more uses of the left paw, and the alternative hypothesis had been less than 0.5 uses of left paw, you could do a one-tailed test and use P=0.054. But you almost never have a situation where a one-tailed test is appropriate.)

 Graph showing the probability distribution for the binomial with 10 trials.

The most common use of an exact binomial test is when the null hypothesis is that numbers of the two outcomes are equal. In that case, the meaning of a two-tailed test is clear, and the two-tailed P-value is found by multiplying the one-tailed P-value times two.

When the null hypothesis is not a 1:1 ratio, but something like a 3:1 ratio, the meaning of a two-tailed exact binomial test is not agreed upon; different statisticians, and different statistical programs, have slightly different interpretations and sometimes give different results for the same data. My spreadsheet adds the probabilities of all possible outcomes that are less likely than the observed numbers; this method of small P-values is preferred by most statisticians.

### Examples

Mendel crossed pea plants that were heterozygotes for green pod/yellow pod; pod color is the nominal variable, with "green" and "yellow" as the values. If this is inherited as a simple Mendelian trait, with green dominant over yellow, the expected ratio in the offspring is 3 green: 1 yellow. He observed 428 green and 152 yellow. The expected numbers of plants under the null hypothesis are 435 green and 145 yellow, so Mendel observed slightly fewer green-pod plants than expected. The P-value for an exact binomial test is 0.502 (using the method of small P-values), indicating that the null hypothesis cannot be rejected; there is no significant difference between the observed and expected frequencies of pea plants with green pods.

 Female Roptrocerus xylophagorum.

Roptrocerus xylophagorum is a parasitoid of bark beetles. To determine what cues these wasps use to find the beetles, Sullivan et al. (2000) placed female wasps in the base of a Y-shaped tube, with a different odor in each arm of the Y, then counted the number of wasps that entered each arm of the tube. In one experiment, one arm of the Y had the odor of bark being eaten by adult beetles, while the other arm of the Y had bark being eaten by larval beetles. Ten wasps entered the area with the adult beetles, while 17 entered the area with the larval beetles. The difference from the expected 1:1 ratio is not significant (P=0.248). In another experiment that compared infested bark with a mixture of infested and uninfested bark, 36 wasps moved towards the infested bark, while only 7 moved towards the mixture; this is significantly different from the expected ratio (P=9×10-6).

Yukilevich and True (2008) mixed 30 male and 30 female Drosophila melanogaster from Alabama with 30 male and 30 females from Grand Bahama Island. They observed 246 matings; 140 were homotypic (male and female from the same location), while 106 were heterotypic (male and female from different locations). The null hypothesis is that the flies mate at random, so that there should be equal numbers of homotypic and heterotypic matings. There were significantly more homotypic matings (exact binomial test, P=0.035) than heterotypic.

### Graphing the results

You plot the results of an exact test the same way would any other goodness-of-fit test.

### Similar tests

A G-test or chi-square goodness-of-fit test could also be used for the same data as the exact test of goodness-of-fit. Where the expected numbers are small, the exact test will give more accurate results than the G-test or chi-squared tests. Where the sample size is large (over a thousand), attempting to use the exact test may give error messages (computers have a hard time calculating factorials for large numbers), so a G-test or chi-square test must be used. For intermediate sample sizes, all three tests give approximately the same results. I recommend that you use the exact test when n is less than 1000; see the web page on small sample sizes for further discussion.

The exact test and randomization test should give you the same result, if you do enough replicates for the randomization test, so the choice between them is a matter of personal preference. The exact test sounds more "exact"; the randomization test may be easier to understand and explain.

The sign test is a particular application of the exact binomial test. It is usually used when observations of a measurement variable are made in pairs (such as right-vs.-left or before-vs.-after), and only the direction of the difference, not the size of the difference, is of biological interest.

The exact test for goodness-of-fit is not the same as Fisher's exact test of independence. A test of independence is used for two nominal variables, such as sex and location. If you wanted to compare the ratio of males to female students at Delaware to the male:female ratio at Maryland, you would use a test of independence; if you want to compare the male:female ratio at Delaware to a theoretical 1:1 ratio, you would use a goodness-of-fit test.

### How to do the test

I have set up a spreadsheet that performs the exact binomial test for sample sizes up to 1000. It is self-explanatory.

#### Web page

Richard Lowry has set up a web page that does the exact binomial test. I'm not aware of any web pages that will do exact multinomial tests.

#### SAS

Here is a sample SAS program, showing how to do the exact binomial test on the Gus data. The p=0.5 gives the expected proportion of whichever value of the nominal variable is alphabetically first; in this case, it gives the expected proportion of "left."

The SAS exact binomial function finds the two-tailed P-value by doubling the P-value of one tail. The binomial distribution is not symmetrical when the expected proportion is other than 50 percent, so the technique SAS uses isn't as good as the method of small P-values. I don't recommend doing the exact binomial test in SAS when the expected proportion is anything other than 50 percent.

```
data gus;
input paw \$;
cards;
right
left
right
right
right
right
left
right
right
right
;
proc freq data=gus;
tables paw / binomial(p=0.5);
exact binomial;
run;

```

Near the end of the output is this:

```
Exact Test
One-sided Pr <=  P           0.0547
Two-sided = 2 * One-sided    0.1094

```

The "Two-sided=2*One-sided" number is the two-tailed P-value that you want.

If you have the total numbers, rather than the raw values, you'd use a "weight" parameter in PROC FREQ. The zeros option tells it to include observations with counts of zero, for example if Gus had used his left paw 0 times; it doesn't hurt to always include the zeros option.

```
data gus;
input paw \$ count;
cards;
right 10
left 2
;
proc freq data=gus;
weight count / zeros;
tables paw / binomial(p=0.5);
exact binomial;
run;

```

This example shows how do to the exact multinomial test. The numbers are Mendel's data from a genetic cross in which you expect a 9:3:3:1 ratio of peas that are round+yellow, round+green, wrinkled+yellow, and wrinkled+green. The order=data option tells SAS to analyze the data in the order they are input (rndyel, rndgrn, wrnkyel, wrnkgrn, in this case), not alphabetical order. The testp=(0.5625 0.1875 0.0625 0.1875) lists the expected proportions in the same order.

```
data peas;
input color \$ count;
cards;
rndyel 315
rndgrn 108
wrnkyel 101
wrnkgrn 32
;
proc freq data=peas order=data;
weight count / zeros;
tables color / chisq testp=(0.5625 0.1875 0.1875 0.0625);
exact chisq;
run;

```

The P-value you want is labelled "Exact Pr >= ChiSq":

```
Chi-Square Test
for Specified Proportions
-------------------------------------
Chi-Square                     0.4700
DF                                  3
Asymptotic Pr >  ChiSq         0.9254
Exact      Pr >= ChiSq         0.9272

```

### Power analysis

For the exact binomial test, you can do the power analysis with this formthe form at http://udel.edu/~mcdonald/statexactbin.html. Enter the probability of one of the two outcomes under the null hypothesis; the probability under the alternative hypothesis; the significance level of the test (you will almost always use 0.05); and the power (0.80, 0.90, and 0.50 are common values). You should almost always use the two-tailed test.

As an example, let's say I wanted to do an experiment to see if Gus the cat really did use one paw more than the other for getting my attention. The null hypothesis is that the probability that he uses his left paw is 0.50. I decide that if the probability of him using his left paw is 0.40, I want my experiment to have an 80% probability of getting a significant (P<0.05) result. If he uses his left paw 60% of the time, I'll accept that as a significant result too, so it's a two-tailed test. Entering 0.50, 0.40, 0.05, and 0.80 in the boxes, and choosing two-tailed test, the result is 210. This means that if Gus really is using his left paw 40% (or 60%) of the time, a sample size of 210 observations will have an 80% probability of giving me a significant (P<0.05) exact binomial test.

Note that if the null expectation is not 0.50, you will get different results, depending on whether you make the alternative proportion smaller or larger than the expected. For example, if you do a genetic cross in which 25% of the offspring are expected to be yellow, it would take 142 observations to give you 80% power if the alternative hypothesis is 15% yellow, but 174 observations if the alternative hypothesis is 35% yellow. In this situation, you should probably use the larger number.

This form uses exact probability calculations for sample sizes less than 500, and the normal approximation for larger sample sizes. Techniques used by other programs may give somewhat different results (Chernick and Liu 2002). For example, G*power gives 195 as the sample size needed for the Gus paw experiment, not 210 as shown by my form. Because both power and effect size are usually just arbitrary round numbers, where it would be easy to justify other values that would change the required sample size, small differences in the method used to calculate desired sample size are probably not very important.

Proportion if null hypothesis is true:
Proportion if alternative hypothesis is true:
Alpha (significance level of test):
Power (probability of significant result if alternative hypothesis is true):
Two-tailed test     One-tailed test

If your nominal variable has more than two values, use this power and sample size page. It is designed for chi-square tests, not exact tests, but the sample sizes will be very close. Choose "Generic chi-square test" from the box on the left side of the page (if you don't see the list of tests, make sure your web browser has Java turned on). Under "Prototype data," enter the chi-square value and sample size for some fake data. For example, if you're doing a genetic cross with an expected 1:2:1 ratio, and your minimum effect size is 10 percent more heterozygotes than expected, use the chi-square spreadsheet to do a chi-square test on observed numbers of 20:60:20 compared to expected proportions of 1:2:1. The spreadsheet gives you a chi-square value of 4.00 and an n of 100, which you enter under "Prototype data". Then set d (the degrees of freedom) equal to 2, and leave alpha at 0.05. The sliders can then be slid back and forth to yield the desired result. For example, if you slide the Power to 0.90, n is equal to 316. Note that the absolute values of the prototype data don't matter, only their relative relationship; you could have used 200:600:200, which would give you a chi-square value of 40.0 and an n of 1000, and gotten the exact same result.

Sokal and Rohlf, pp. 686-687.

Zar, pp. 533-538.

### References

Picture of wasp from Universal Chalcidoidea Database.

Chernick, M.R., and C.Y. Liu. 2002. The saw-toothed behavior of power versus sample size and software solutions: single binomial proportion using exact methods. Amer. Stat. 56: 149-155.

Mendel, G. 1865. Experiments in plant hybridization. available at MendelWeb.

Sullivan, B.T., E.M. Pettersson, K.C. Seltmann, and C.W. Berisford. 2000. Attraction of the bark beetle parasitoid Roptrocerus xylophagorum (Hymenoptera: Pteromalidae) to host-associated olfactory cues. Env. Entom. 29: 1138-1151.

Yukilevich, R., and J.R. True. 2008. Incipient sexual isolation among cosmopolitan Drosophila melanogaster populations. Evolution 62: 2112-2121.