When to use it
Sometimes you'll do a goodness-of-fit experiment more than once; for example, you might look at the fit to a 3:1 ratio of a genetic cross in more than one family, or fit to a 1:1 sex ratio in more than one population, or fit to a 1:1 ratio of broken right and left ankles on more than one sports team. One question then is, should you analyze each experiment separately, risking the chance that the small sample sizes will have insufficient power? Or should you pool all the data, risking the chance that the different experiments gave different results? This is when the additive property of the G-test of goodness-of-fit becomes important, because you can do a replicated G-test of goodness-of-fit.
You use the replicated G-test of goodness of fit when you have two nominal variables with two or more values (such as red vs. pink vs. white flowers for one variable), one of the nominal variables represents different replicates of the same experiment (different days, different locations, different pairs of parents), and the observed data are compared with an extrinsic theoretical expectation (such as an expected 1: 2: 1 ratio in a genetic cross). I do not know if this analysis would be appropriate with an intrinsic hypothesis, such as the p2: 2pq: q2 Hardy-Weinberg proportions of population genetics.
This technique tests four null hypotheses. The first statistical null hypothesis is that the number of observations in each category is equal to that predicted by a biological theory, and the alternative hypothesis is that the observed numbers are different from the expected. This is the same null hypothesis as for a regular G-test of goodness-of-fit. This is tested for each individual experiment. The second null hypothesis is that overall, the data from the individual experiments fit the expectations. This null hypothesis is a bit difficult to grasp, but being able to test it is the main value of doing a replicated G-test of goodness-of-fit. The third null hypothesis is that the relative proportions are the same across the different experiments; this is the same as the null hypothesis for a G-test of independence. The fourth null hypothesis is that the pooled data set fits the expected proportions.
How to do the test
First, do a G-test of goodness-of-fit for each individual data set. The resulting G-values are the "individual G-values." Also record the number of degrees of freedom for each individual data set; these are the "individual degrees of freedom." Even if nothing else is significant, it is interesting if one or more of these tests are significant.
(Note: Some programs use "continuity corrections," such as the Yates correction or the Williams correction, in an attempt to make G-tests more accurate for small sample sizes. Do not use any continuity corrections when doing a replicated G-test, or the G-values will not add up properly. My spreadsheet for G-tests of goodness-of-fit does not use any continuity corrections.)
Next, add up all the individual G-values to get the "total G-value", and add up the individual degrees of freedom to get the "total degrees of freedom." Use the CHIDIST function in a spreadsheet or online chi-square calculator to find the P value for the total G-value with the total degrees of freedom. For example, if your total G-value is 12.33 and your total degrees of freedom is 6, enter "=CHIDIST(13.43, 6)". The result will be the P-value for the total G; in this case, 0.0367. If it is significant, you can reject one null hypothesis, that all of the data from the different experiments fit the expected ratio, but you cannot tell yet in what way the data are inconsistent with the expected ratio.
Next, add up the number of observations in each class. For the genetic cross example, you would add up the number of red flowers in all the crosses, all the pink flowers, and all the white flowers. Do a G-test of goodness-of-fit on this pooled data set. This gives you the "pooled G-value." The degrees of freedom is the "pooled degrees of freedom," and it is just the number of classes minus one (the same as for a regular goodness-of-fit test). Find the P-value using the CHIDIST function. The P-value for this test tells you whether the pooled data set deviates significantly from the expected ratio.
Finally, subtract the pooled G-value from the total G-value, and subtract the pooled degrees of freedom from the total degrees of freedom. This gives you the "heterogeneity G-value" and "heterogeneity degrees of freedom." Find the P-value using the CHIDIST function. If the heterogeneity G-value is significant, it means that the individual data sets have significantly different ratios from each other. The heterogeneity G is the same as a G-test of independence comparing the different ratios.
If the heterogeneity G-value is not significant, you can accept one null hypothesis (that the replicates have the same ratios), pool the data and treat them as if they came from one big experiment. Then you can use the pooled G-value to test the null hypothesis that the data fit the expected ratio.
However, if the heterogeneity G-value is significant, you reject the null hypothesis that the replicates have the same ratios. This means that you cannot pool the data and use the pooled G-value to test anything; you shouldn't pool data sets that are significantly different from each other. In this case, you would investigate the individual data sets more thoroughly, starting by looking at the significance of the individual G-values and then using more sophisticated methods that aren't described here (see Sokal and Rohlf 1995, pp. 722-724).
It won't happen very often, but it's possible that neither the heterogeneity G-value nor the pooled G-value will be significant, but the total G-value will. This rather frustrating result would mean that you could reject the hypothesis that the data all fit the theoretical expectation, but you wouldn't know whether it was due to heterogeneity among the data sets, an overall deviation, or some combination of the two.
The imaginary data set shown below is the result of eight crosses of heterozygous red/white flowers. The expected proportions of offspring are 0.25 red, 0.50 pink, and 0.25 white. The two nominal variables are color (red, pink, or white) and which cross it is.
The total G-value (32.63), found by summing the eight individual G-values, is significant (P=0.008). This means that the data do not fit the expected 1:2:1 proportions in some way. The pooled G-value (7.89), found by doing a G-test of goodness-of-fit on the pooled data (230 red, 430 pink, 176 white), is significant (P=0.019), which might suggest that there is an overall deviation from the expected proportions. However, the heterogeneity G-value (24.74), found by subtracting the pooled G-value from the total G-value, is also significant (P=0.037). This means that the eight crosses were significantly different from each other in their ratios of red to pink to white, so it would be incorrect to use the pooled G-value for any hypothesis test.
Connallon and Jakubowski (2009) performed mating competitions among male Drosophila melanogaster. They took the "unpreferred" males that had lost three competitions in a row and mated them with females, then looked at the sex ratio of the offspring. They did this for three separate sets of flies.
The total G-value is significant, so we can reject the null hypotheses that all three trials have the same 1:1 sex ratio. The heterogeneity G-value is not significant; although the results of the second trial may look quite different from the results of the first and third trials, the three trials are not significantly different. We can therefore look at the pooled G-value. It is significant; the unpreferred males have significantly more daughters than sons.
For replicated goodness-of-fit tests, you must use the G-test, not the chi-squared test. Chi-squared values, although they are similar to G-values, do not add up the same way; the heterogeneity chi-square plus the pooled chi-square does not equal the total of the individual chi-squares. You could do a chi-squared test of independence among the replicates, then if that is not significant, pool the data and do a chi-squared goodness-of-fit test. However, you would not be able to detect the kind of overall deviation that the total G-value tests for.
Sokal and Rohlf, pp. 715-724.
Connallon, T., and E. Jakubowski. 2009. Association between sex ratio distortion and sexually antagonistic fitness consequences of female choice. Evolution 63: 2179-2183.
This page was last revised September 11, 2009. Its address is http://udel.edu/~mcdonald/statrepgtestgof.html. It may be cited as pp. 84-87 in:
McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.