One-way anova: Partitioning variance components

In a Model II anova with significant heterogeneity among the means, the next step is to partition the variance into among-group and within-group components. Under the null hypothesis of homogeneity of means, the among-group mean square and within-group mean square are both estimates of the within-group parametric variance. If the means are heterogeneous, the within-group mean square is still an estimate of the within-group variance, but the among-group mean square estimates the sum of the within-group variance plus the group sample size times the added variance among groups. Therefore subtracting the within-group mean square from the among-group mean square, and dividing this difference by the average group sample size, gives an estimate of the added variance component among groups. The equation is:

among-group variance=(MSamong−MSwithin)/no

where no is a number that is close to, but usually slightly less than, the arithmetic mean of the sample size (ni) of each of the a groups:

no=(1/(a−1))*(sum(ni)−(sum(ni2)/sum(ni)).

Each component of the variance is often expressed as a percentage of the total variance components. Thus an anova table for a one-way anova would indicate the among-group variance component and the within-group variance component, and these numbers would add to 100%. In a more complicated anova, such as a nested anova, there would be several variance components, but they would still add up to 100%.

Here's an explanation that is not strictly correct (it obscures some of the mathematical details) but gives an intuitive idea of what it means to partition the variance. Here is a very simple anova, with just three observations in two categories:

A      B
10      4
12      5
8      3

First, calculate the mean for all six observations, which is 7. Subtracting 7 from each observation and squaring the difference gives you the squared deviates:

A               B
(10-7)2 = 9      (4-7)2 = 9
(12-7)2 = 25     (5-7)2 = 4
(8-7)2 = 1      (3-7)2 = 16

The sum of these squared deviates is the total sum of squares. In this case, the squared deviates add up to 64. This number is a measure of how far the individual observations are from the overall mean.

Next, calculate the mean for A, which is 10, and the mean for B, which is 4. Subtract each group's mean from the observations in that group and square the differences:

A               B
(10-10)2 = 0      (4-4)2 = 0
(12-10)2 = 4      (5-4)2 = 1
(8-10)2 = 4      (3-4)2 = 1

Notice that these squared deviates from the group means are, in general, smaller than the squared deviates from the overall mean. This makes sense; a member of a group is likely to be closer to its group's mean than it is to the mean of that group plus some other, different groups. Adding these squared deviates together gives us the within-group sum of squares, which in this case is 10. This is a measure of how far the individual observations are from their group means.

The difference between the total sum of squares and the within-group sum of squares is the among-group sum of squares. It is a measure of how much smaller the sum of squares gets when you use group means instead of the overall mean. When the group means are not very different from each other, they will all be close to the overall mean. In that case, the squared deviates from the group means will not be much smaller than the squared deviates from the overall mean, and the among-group sum of squares will be small. When the group means are very different from each other, the group means will be very different from the overall mean, the squared deviates from the group means will be a lot smaller, and the among-group sum of squares will be large.

The among-group sum of squares in this example is 64 minus 10, or 54, while the within-group sum of squares is 10. Expressed as a percentage of the total, the among-group variation represents 54/64 = 84.4% of the total; another way of saying this is that the groups "explain" 84.4% of the variation. The remaining 15.6% of the variation is within groups.

Because the sums of squares are estimates of population parameters, converting them to estimates of the variance components is considerably more complicated; the actual estimate of the among-group component for this example is 87.3% of the total. But the basic idea, that a larger among-groups component indicates larger differences among the group means relative to the within-group variation, remains the same.

Although statisticians say that each level of an anova "explains" a proportion of the variation, this statistical jargon does not mean that you've found a biological cause-and-effect explanation. If you measure the number of ears of corn per stalk in 10 random locations in a field, analyze the data with a one-way anova, and say that the location "explains" 74.3% of the variation, you haven't really explained anything; you don't know whether some areas have higher yield because of different water content in the soil, different amounts of insect damage, different amounts of nutrients in the soil, or random attacks by a band of marauding corn bandits.

Partitioning the variance components is particularly useful in quantitative genetics, where the within-family component might reflect environmental variation while the among-family component reflects genetic variation. Of course, estimating heritability involves more than just doing a simple anova, but the basic concept is similar.

Another area where partitioning variance components is useful is in designing experiments. For example, let's say you're planning a big experiment to test the effect of different drugs on calcium uptake in rat kidney cells. You want to know how many rats to use, and how many measurements to make on each rat, so you do a pilot experiment in which you measure calcium uptake on 6 rats, with 4 measurements per rat. You analyze the data with a one-way anova and look at the variance components. If a high percentage of the variation is among rats, that would tell you that there's a lot of variation from one rat to the next, but the measurements within one rat are pretty uniform. You could then design your big experiment to include a lot of rats for each drug treatment, but not very many measurements on each rat. Or you could do some more pilot experiments to try to figure out why there's so much rat-to-rat variation (maybe the rats are different ages, or some have eaten more recently than others, or some have exercised more) and try to control it. On the other hand, if the among-rat portion of the variance was low, that would tell you that the mean values for different rats were all about the same, while there was a lot of variation among the measurements on each rat. You could design your big experiment with fewer rats and more observations per rat, or you could try to figure out why there's so much variation among measurements and control it better.

Partitioning the variance applies only to a model II one-way anova. It doesn't really tell you anything useful about a model I one-way anova, although sometimes people like to report it (because they're proud of how much of the variance their groups "explain," I guess).

Performing the analysis

The one-way anova spreadsheet, described on the anova significance page, calculates the within- and among-group components of variance and displays them as a percentage of the total.

Web pages

I don't know of any web pages that will calculate the variance components of an anova.

SAS

PROC GLM doesn't calculate the variance components for an anova. Instead, you use PROC VARCOMP. You set it up just like PROC GLM, with the addition of METHOD=TYPE1. The procedure has four different methods for estimating the variance components, and TYPE1 seems to be the same technique as the one I've described above. Here's how to do the one-way anova, including estimating the variance components, for the mussel shell example from the one-way anova page.

proc glm data=musselshells;
class location;
model aam = location;
proc varcomp data=musselshells method=type1;
class location;
model aam = location;
run;

The results include the following:

Type 1 Estimates

Variance Component        Estimate

Var(location)            0.0001254
Var(Error)               0.0001587

The output is not given as a percentage of the total, so you'll have to calculate that. For these results, the among-group component is 0.0001254/(0.0001254+0.0001586)=0.4415, or 44.15%; the within-group component is 0.0001587/(0.0001254+0.0001586)=0.5585, or 55.85%.