Basics

Tests for nominal variables

Descriptive statistics

Tests for one measurement variable

Tests for multiple measurement variables

Multiple tests

Miscellany

Power analysis


When you are designing an experiment, it is a good idea to estimate the sample size you'll need. This is especially true if you're proposing to do something painful to humans or other vertebrates, where it is particularly important to minimize the number of individuals (without making the sample size so small that the whole experiment is a waste of time and suffering), or if you're planning a very time-consuming or expensive experiment. Methods have been developed for many statistical tests to estimate the sample size needed to detect a particular effect, or to estimate the size of the effect that can be detected with a particular sample size.

The problem with these techniques is that they require an accurate estimate of the size of the difference you are looking for (the "effect size"), and (for measurement variables) an accurate estimate of the standard deviation. When doing basic biological research, you usually don't know how big a difference you're looking for; if you knew how big the difference would be, you wouldn't need to do the experiment. In addition, even if you have enough preliminary data to estimate the standard deviation of a measurement variable, there's no guarantee it will be the same when you do a new experiment. You should become familiar with the concept of power analysis, but you should take the results with a large grain of salt.

Parameters

There are four or five numbers involved in a power analysis. The minimum effect size is the minimum difference you hope to detect. For example, if you are treating hens with something that you hope will change the sex ratio of their chicks, you might decide that the minimum change in the proportion of sexes that you're looking for is 10 percent. The minimum effect size is often just a nice round number of no particular meaning, which is one reason you should be skeptical of power analyses.

Alpha is the significance level of the test (the P-value), the probability of rejecting the null hypothesis even though it is true (a false positive). The usual value is alpha=0.05. Beta, in a power analysis, is the probabilty of accepting the null hypothesis, even though it is false (a false negative), when the real difference is equal to the minimum effect size. There is no firm rule about what value of beta to use; 50 percent or 20 percent or 10 percent seem fairly common. The power of a test is the probability of rejecting the null hypothesis when the real difference is equal to the minimum effect size, or 1−beta.

For measurement variables, you also need an estimate of the standard deviation. This can come from pilot experiments or from similar experiments in the published literature. Your standard deviation once you do the experiment is unlikely to be the same, which is another reason to be skeptical of power analyses. For nominal variables, the standard deviation is a simple function of the sample size, so you don't need to estimate it separately.

On most web pages that do power analyses, you can either enter the desired power and estimate the sample size needed, or enter the sample size and estimate the power. If the effect size is really the minimum specified, and the standard deviation is as specified, the probability that this sample size will give a significant result (at the P<alpha level) is 1−beta, and the probability that it won't give a significant result is beta.

The equations for estimating sample size from alpha, beta, standard deviation, and minimum effect size can be quite complicated. Fortunately, there are online calculators for doing power analyses for many statistical tests. I'll try to put a link for power analysis on the web page for each statistical test.

How it works

The details of a power analysis are different for different statistical tests, but the basic concepts are similar; here I'll use the exact binomial test as an example. Imagine that you are studying wrist fractures, and your null hypothesis is that half the people who break one wrist break their right wrist, and half break their left. You decide that the minimum effect size is 10 percent; if the percentage of people who break their right wrist is 60 percent or more, or 40 percent or less, you want to have a significant result from the exact binomial test. Alpha is 5 percent, as usual. You want beta to be 10 percent, which means that if the percentage of broken right wrists is 40 percent or 60 percent, you want a sample size that will yield a significant (P<0.05) result 90 percent of the time, and a non-significant result (which would be a false negative in this case) only 10 percent of the time.

Binomial graphs

The first graph shows the probability distribution under the null hypothesis, with a sample size of 50 individuals. In order to be significant at the P<0.05 level, the observed result would have to be less than 36 percent or more than 64 percent of people breaking their right wrists. As the second graph shows, if the true percentage is 40 percent, the sample data will be this extreme only 21 percent of the time. Obviously, a sample size of 50 is too small for this experiment; it would only yield a significant result 21 percent of the time, even if there's a 40:60 ratio of broken right wrists to left wrists.


Binomial graphs

The next graph shows the probability distribution under the null hypothesis, with a sample size of 270 individuals. In order to be significant at the P<0.05 level, the observed result would have to be less than 43.7 percent or more than 56.3 percent of people breaking their right wrists. As the second graph shows, if the true percentage is 40 percent, the sample data will be this extreme 90 percent of the time. A sample size of 270 is pretty good for this experiment; it would yield a significant result 90 percent of the time if there's a 40:60 ratio of broken right wrists to left wrists.

Examples

You plan to cross peas that are heterozygotes for Yellow/green, where Yellow is dominant. The expected ratio in the offspring is 3 Yellow: 1 green. You want to know whether yellow peas are actually more or less fit, which might show up as a different proportion of yellow peas than expected. You arbitrarily decide that you want a sample size that will detect a significant (P<0.05) difference if there are 3 percent more or fewer yellow peas than expected, with a beta (probability of false negatives) of 10 percent. You will test the data using the exact binomial test for goodness-of-fit if the sample size is small enough, or a G-test for goodness-of-fit if the sample size is larger. The power analysis is the same for both tests.

Go to the power analysis for proportions web page. Enter 75 percent for "Test value" (the expected proportion of yellow peas under the null hypothesis) and enter 72 percent for "Sample percentage" (3 percent fewer yellow peas). Enter 2.5 percent for "Alpha error level," because this particular web page is set up for one-tailed tests, and you're doing the more common two-tailed test. Enter 10 percent for "Beta error level," then click on "Calculate sample size." The result is 2253. That's a lot of peas! Note that, because the confidence intervals on a percentage are not symmetrical, the results are different if you enter 78 percent for "Sample percentage"; you should try it both ways and use the larger sample size result.


The example data for Student's t-test shows that the average height in the 2 p.m. section of Biological Data Analysis was 66.6 inches and the average height in the 5 p.m. section was 64.6 inches, but the difference is not significant (P=0.207). To figure out how much bigger the samples would have to be for this difference to be significant, go to the power analysis for t-tests web page. Enter 66.6 and 64.6 for the average values for samples 1 and 2. Using the STDEV function in Excel, calculate the standard deviation for each sample; it is 4.8 for sample 1 and 3.6 for sample 2. Enter 5 percent for alpha and 10 percent for beta. The result is 308, meaning that if 5 p.m. students really were two inches shorter than 2 p.m. students, you'd need 308 students in each class to detect a significant difference 90 percent of the time.

Further reading

Sokal and Rohlf, pp. 167-169.

Zar, p. 83.



Return to the Biological Data Analysis syllabus

Return to John McDonald's home page

This page was last revised January 1, 2008. Its address is http://udel.edu/~mcdonald/statsampsize.html.
©2007-2008 by John H. McDonald. You can probably do what you want with this content; see the permissions pagepermissions page at http://udel.edu/~mcdonald/statpermissions.html for details.