One of the assumptions of an anova and other parametric tests is that the within-group variances of the groups are all the same (exhibit homoscedasticity). If the variances are different from each other (exhibit heteroscedasticity), the probability of obtaining a "significant" result even though the null hypothesis is true may be greater than the desired alpha level.
To illustrate this problem, I've done simulations of samples from three populations, all with the same population mean. I simulated taking samples of 10 observations from population A, 7 from population B, and 3 from population C, and repeated this process thousands of times. When the three populations are homoscedastic (have the same standard deviation), the one-way anova on the simulated data sets are significant (P<0.05) about 5 percent of the time, as they should be. However, when I make the standard deviations different (1.0 for population A, 2.0 for population B, and 3.0 for population C), I get a P value less than 0.05 in about 18 percent of the simulations. In other words, even though the population means are really all the same, my chance of getting a false positive result is 18 percent, not the desired 5 percent.
There have been a number of simulation studies that have tried to determine when heteroscedasticity is a big enough problem that other tests should be used. Early results suggested that heterscedasticity was not a problem if all the sample sizes were equal (Glass et al. 1972), but later results found that large amounts of heteroscedasticity can inflate the false positive rate, even when the sample sizes are equal (Harwell et al. 1992). The problem is made worse when the sample sizes are unequal and the smaller samples are from populations with larger variances; but when the smaller samples are from populations with smaller variances, the false positive rate can actually be much less than 0.05, meaning the power of the test is reduced (Glass et al. 1972).
Despite all of the simulation studies that have been done, there does not seem to be a consensus about when heteroscedasticity is a big enough problem that alternatives to anova should be used. I have written a spreadsheet to simulate one-way anova with heteroscedasticity that may help you decide this for a one-way anova or Student's t-test. To use it, enter the sample sizes for your observed data, plus the standard deviation within each group. Then hit the option, command, and r keys simultaneously. This will run a macro that creates simulated data sets 1000 times. For each simulation, a set of observations is drawn at random for each group. Each group is normally distributed and has the population standard deviation that you've specified, and all the groups have the same population mean. A one-way anova is done on the simulated data (if you have only two groups, this is the same as Student's t-test), and the number of simulations that have a P-value less than 0.05 is recorded.
If the false positive rate in the simulations is near 0.05, then heteroscedasticity is probably not a problem for your data set. If the false positive rate is too high (defined as above 0.075, by Bradley's  liberal criterion for robustness), you can try a data transformation; if that doesn't reduce the heteroscedasticity enough, you should use an alternative test that is less sensitive to heteroscedasticity.
Note that there is a problem with this simulation approach. Even if all the population standard deviations are the same, your sample standard deviations will be different from each other; with small sample sizes (fewer than 10), there can be quite a bit of variation among the sample standard deviations. Using your sample standard deviations as estimates of the population standard deviations in the simulations will therefore exaggerate the effects of heteroscedasticity. I don't know how to correct for this.
If the variances of your groups are very heterogeneous no matter what transformation you apply, there are a large number of alternative tests to choose from (Lix et al. 1996). The most commonly used is probably Welch's test, sometimes called Welch's t-test when there are two groups. Non-parametric tests, such as the Kruskal–Wallis test instead of a one-way anova, do not assume normality, but they do assume that the shapes of the distributions in different groups are the same, so they are not a good solution to the problem of heteroscedasticity.
All of the discussion above has been about one-way anovas. Homoscedasticity is also an assumption of other anovas, such as nested and two-way anovas, and regression and correlation. Much less work has been done on the effects of heteroscedasticity on these tests; all I can recommend is that you inspect the data for heteroscedasticity and hope that you don't find it.
The usual test for homogeneity of variances is Bartlett's test. This test is used when you have one measurement variable, one nominal variable, and you want to test the null hypothesis that the variances of the measurement variable are the same for the different groups. The basic idea is that the natural log of the variance is calculated for each group, then these are averaged. The variances are also averaged across groups. The average of the natural logs of the individual variances is subtracted from the natural log of the average variance. Under the null hypothesis of homogeneity of variances, this statistic is chi-square distributed with d.f. equal to the number of groups minus one.
Bartlett's test is not a particularly good one, because it is sensitive to departures from normality as well as heteroscedasticity. It may be more helpful to use Bartlett's test to see what effect different transformations have on the heteroscedasticity, choosing the transformation with the highest (least significant) P-value, rather than take the P values too seriously; you shouldn't panic just because you have a significant Bartlett's test.
An alternative to Bartlett's test that I won't cover here is Levene's test. It is less sensitive to departures from normality, but if the data are approximately normal, it is less powerful than Bartlett's test.
For many measurement variables in biology, the coefficient of variation is the same for different groups. When this occurs, groups with larger means will also have larger variances. For example, if the petal lengths of red roses and pink roses both have a coefficient of variation of 20%, and red roses have a mean petal length 10% longer than pink roses, then the standard deviation of red petal length is 10% larger than for pink petal length. This means the variance is 21% larger for red roses. This kind of heteroscedasticity, in which the variance increases as the mean increases, can usually be greatly reduced with the right data transformation.
While Bartlett's test is usually used when examining data to see if it's appropriate for a parametric test, there are times when testing the homogeneity of variances is the primary goal of an experiment. For example, let's say you want to know whether variation in stride length among runners is related to their level of experience—maybe as people run more, those who started with unusually long or short strides gradually converge on some ideal stride length. You could measure the stride length of non-runners, beginning runners, experienced amateur runners, and professional runners, with several individuals in each group, then use Bartlett's test to see whether there was significant heterogeneity in the variances.
How to do Bartlett's test
I have put together a spreadsheet that performs Bartlett's test for homogeneity of variances for up to 1000 observations in each of up to 50 groups. It allows you to see what the log or square-root transformation will do. It also shows a graph of the standard deviations plotted vs. the means. This gives you a quick visual display of the difference in amount of variation among the groups, and it also shows whether the mean and standard deviation are correlated.
Entering the mussel shell data from the one-way anova web page into the spreadsheet, the P-values are 0.655 for untransformed data, 0.856 for square-root transformed, and 0.929 for log-transformed data. None of these is close to significance, so there's no real need to worry. The graph of the untransformed data hints at a correlation between the mean and the standard deviation, so it might be a good idea to log-transform the data:
|Standard deviation vs. mean AAM for untransformed and log-transformed data.|
|Standard deviation vs. mean AAM for untransformed and log-transformed data.|
There is web page for Bartlett's test that will handle up to 14 groups. You have to enter the variances and sample sizes, not the raw data.
You can use the HOVTEST=BARTLETT option in the MEANS statement of PROC GLM to perform Bartlett's test. This modification of the program from the one-way anova page does Bartlett's test.
proc glm data=musselshells; class location; model aam = location; means location / hovtest=bartlett; run;
Sokal and Rohlf, pp. 398-399.
Zar, pp. 185, 202-204.
Bradley, J.V. 1978. Robustness? Brit. J. Math. Statis. Psychol. 31: 144-155.
Glass, G.V., P.D. Peckham, and J.R. Sanders. 1972. Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.
Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.
Lix, L.M., J.C. Keselman, and H.J. Keselman. 1996. Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.
This page was last revised September 12, 2009. Its address is http://udel.edu/~mcdonald/stathomog.html. It may be cited as pp. 155-159 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.