Normality
![]() |
| Histogram of dry weights of the amphipod crustacean Platorchestia platensis. |
One of the assumptions of an anova and other parametric tests is that the data are normally distributed. When you plot a frequency histogram, the frequencies should approximate the familiar bell-shaped normal curve. For example, the figure shown at the right is a histogram of dry weights of newly hatched amphipods (Platorchestia platensis). It fits the normal curve pretty well.
![]() |
| Two non-normal histograms. |
Other data sets don't fit the normal curve very well. The histogram on the top is the level of sulphate in Maryland streams (data from the Maryland Biological Stream Survey). It doesn't fit the normal curve very well, because there are a small number of streams with very high levels of sulphate. The histogram on the bottom is the number of egg masses laid by indivuduals of the lentago host race of the treehopper Enchenopa (unpublished data courtesy of Michael Cast). The curve is bimodal, with one peak at around 14 egg masses and the other at zero.
Like other parametric tests, the analysis of variance assumes that the data fit the normal distribution. If your measurement variable is not normally distributed, you may be increasing your chance of a false positive result if you analyze the data with an anova or other test that assumes normality. Fortunately, an anova is not very sensitive to moderate deviations from normality, because when you take a large number of random samples from a population, the means of those samples are approximately normally distributed even when the population is not normal.
It is possible to test the goodness-of-fit of a data set to the normal distribution. I do not suggest that you do this, because many data sets that are significantly non-normal would be perfectly appropriate for an anova.
Instead, if you have a large enough data set, I suggest you just look at the frequency histogram. If it looks more-or-less normal, go ahead and perform an anova. If it looks like a normal distribution that has been pushed to one side, like the sulphate data above, you should try different data transformations and see if any of them make the histogram look more normal. If that doesn't work, and the data still look severely non-normal, you should analyze it using a non-parametric test. Just about every parametric statistical test has a non-parametric substitute, such as the Kruskal–Wallis test instead of a one-way anova, Wilcoxon signed-rank test instead of a paired t-test, and Spearman rank correlation instead of linear regression.
Many data sets will not be large enough for a histogram to be meaningful; if you have fewer than 10 or so observations, you can't really tell whether the data are normally distributed or not. In that case, you should use your judgement, based on the published literature, your prior experience, and your understanding of biology. For example, if you measure heart rate at random times during the day, you should know enough about the biology of heart rate to know that those numbers won't be normally distributed: the mode will be around the resting heart rate, with a few values much higher (due to exercise or stress) and no values much lower than average resting heart rate.
Skewness and kurtosis
![]() |
| Graphs illustrating skewness and kurtosis. |
A histogram with a long tail on the right side, such as the sulphate data above, is said to be skewed to the right; a histogram with a long tail on the left side is said to be skewed to the left. There is a statistic to describe skewness, g1, but I don't know of any reason to calculate it; there is no rule of thumb that you shouldn't do an anova if g1 is greater than some cutoff value.
Another way in which data can deviate from the normal distribution is kurtosis. A histogram that has a high peak in the middle and long tails on either side is leptokurtic; a histogram with a broad, flat middle and short tails is platykurtic. The statistic to describe kurtosis is g2, but I can't think of any reason why you'd want to calculate it, either.
How to look at normality
Spreadsheet
I've written a spreadsheet that will plot a frequency histogram for untransformed, log-transformed and square-root transformed data. It will handle up to 1000 observations.
Web pages
There are several web pages that will produce histograms, but most of them aren't very good. The interactive histogram web page is pretty cool. You enter your numbers (separated by spaces only, no tabs or line returns), and when you get a histogram, you can change the "binwidth" (the size of each interval on the histogram) by sliding a bar.
SAS
You can use the PLOTS option in PROC UNIVARIATE to get a stem-and-leaf display, which is a kind of very crude histogram. You can also use the HISTOGRAM option to get an actual histogram, but only if you know how to send the output to a graphics device driver.
Further reading
Sokal and Rohlf, pp. 698-703, 406-407.
Zar, pp. 185-188.
⇐ Previous topic | Next topic ⇒
This page was last revised January 4, 2008. Its address is http://udel.edu/~mcdonald/statnormal.html.
©2007-2008 by John H. McDonald. You can probably do what you want with this content; see the permissions page at http://udel.edu/~mcdonald/statpermissions.html for details.


