When you are designing an experiment, it is a good idea to estimate the sample size you'll need. This is especially true if you're proposing to do something painful to humans or other vertebrates, where it is particularly important to minimize the number of individuals (without making the sample size so small that the whole experiment is a waste of time and suffering), or if you're planning a very time-consuming or expensive experiment. Methods have been developed for many statistical tests to estimate the sample size needed to detect a particular effect, or to estimate the size of the effect that can be detected with a particular sample size.
In order to do a power analysis, you need to specify an effect size. This is the size of the difference between your null hypothesis and the alternative hypothesis that you hope to detect. For applied and clinical biological research, there may be a very definite effect size that you want to detect. For example, if you're testing a new dog shampoo, the marketing department at your company may tell you that producing the new shampoo would only be worthwhile if it made dogs' coats at least 25% shinier, on average. That would be your effect size, and you would use it it deciding how many dogs you would need to put through the canine reflectometer.
When doing basic biological research, you often don't know how big a difference you're looking for, and the temptation may be to just use the biggest sample size you can afford, or use a similar sample size to other research in your field. You should still do a power analysis before you do the experiment, just to get an idea of what kind of effects you could detect. For example, some anti-vaccination kooks have proposed that the U.S. government conduct a large study of unvaccinated and vaccinated children to see whether vaccines cause autism. It is not clear what effect size would be interesting: 10% more autism in one group? 50% more? twice as much? However, doing a power analysis shows that even if the study included every unvaccinated child in the United States aged 3 to 6, and an equal number of vaccinated children, there would have to be 25% more autism in one group in order to have a high chance of seeing a significant difference. A more plausible study, of 5,000 unvaccinated and 5,000 vaccinated children, would detect a significant difference with high power only if there were three times more autism in one group than the other. Because it is unlikely that there is such a big difference in autism between vaccinated and unvaccinated children, and because failing to find a relationship with such a study would not convince anti-vaccination kooks that there was no relationship (nothing would convince them there's no relationship—that's what makes them kooks), the power analysis tells you that such a study would not be worthwhile.
There are four or five numbers involved in a power analysis. The minimum effect size is the minimum deviation from the null hypothesis that you hope to detect. For example, if you are treating hens with something that you hope will change the sex ratio of their chicks, you might decide that the minimum change in the proportion of sexes that you're looking for is 10 percent. You might have a good economic reason for choosing the effect size; if not, you might want to see what kind of effects other people have found in similar experiments. If you don't have a particular effect size in mind, you might want to try different effect sizes and produce a graph of effect size vs. sample size.
Alpha is the significance level of the test (the P-value), the probability of rejecting the null hypothesis even though it is true (a false positive). The usual value is alpha=0.05. Some power calculators use the one-tailed alpha, which is confusing, since the two-tailed alpha is much more common. Be sure you know which you're using.
Beta, in a power analysis, is the probabilty of accepting the null hypothesis, even though it is false (a false negative), when the real difference is equal to the minimum effect size. The power of a test is the probability of rejecting the null hypothesis when the real difference is equal to the minimum effect size, or 1−beta. There is no clear consensus on the value to use; a power of 80% (equivalent to a beta of 20%) is probably the most common, while powers of 50% and 90% are also sometimes used. The cost to you of a false negative should influence your choice of power; if you really, really want to be sure that you detect your effect size, you'll want to use a higher value for power (lower beta), which will result in a bigger sample size. Some power calculators ask you to enter beta, while others ask for power (1−beta); be very sure you understand which you need to use.
For measurement variables, you also need an estimate of the standard deviation. This can come from pilot experiments or from similar experiments in the published literature. Your standard deviation once you do the experiment is unlikely to be exactly the same, so your experiment will actually be somewhat more or less powerful than you had predicted. For nominal variables, the standard deviation is a simple function of the sample size, so you don't need to estimate it separately.
On most web pages that do power analyses, you can either enter the desired power and estimate the sample size needed, or enter the sample size and estimate the power. If the effect size is really the minimum specified, and the standard deviation is as specified, the probability that this sample size will give a significant result (at the P<alpha level) is 1−beta, and the probability that it won't give a significant result is beta.
The equations for estimating sample size from alpha, beta, standard deviation, and minimum effect size can be quite complicated. Fortunately, there are online calculators for doing power analyses for many statistical tests. I'll try to put a link for power analysis on the web page for each statistical test.
How it works
The details of a power analysis are different for different statistical tests, but the basic concepts are similar; here I'll use the exact binomial test as an example. Imagine that you are studying wrist fractures, and your null hypothesis is that half the people who break one wrist break their right wrist, and half break their left. You decide that the minimum effect size is 10 percent; if the percentage of people who break their right wrist is 60 percent or more, or 40 percent or less, you want to have a significant result from the exact binomial test. Alpha is 5 percent, as usual. You want power to be 90 percent, which means that if the percentage of broken right wrists is 40 percent or 60 percent, you want a sample size that will yield a significant (P<0.05) result 90 percent of the time, and a non-significant result (which would be a false negative in this case) only 10 percent of the time.
The first graph shows the probability distribution under the null hypothesis, with a sample size of 50 individuals. In order to be significant at the P<0.05 level, the observed result would have to be less than 36 percent or more than 64 percent of people breaking their right wrists. As the second graph shows, if the true percentage is 40 percent, the sample data will be this extreme only 21 percent of the time. Obviously, a sample size of 50 is too small for this experiment; it would only yield a significant result 21 percent of the time, even if there's a 40:60 ratio of broken right wrists to left wrists.
The next graph shows the probability distribution under the null hypothesis, with a sample size of 270 individuals. In order to be significant at the P<0.05 level, the observed result would have to be less than 43.7 percent or more than 56.3 percent of people breaking their right wrists. As the second graph shows, if the true percentage is 40 percent, the sample data will be this extreme 90 percent of the time. A sample size of 270 is pretty good for this experiment; it would yield a significant result 90 percent of the time if there's a 40:60 ratio of broken right wrists to left wrists.
You plan to cross peas that are heterozygotes for Yellow/green pea color, where Yellow is dominant. The expected ratio in the offspring is 3 Yellow: 1 green. You want to know whether yellow peas are actually more or less fit, which might show up as a different proportion of yellow peas than expected. You arbitrarily decide that you want a sample size that will detect a significant (P<0.05) difference if there are 3 percent more or fewer yellow peas than expected, with a power of 90 percent. You will test the data using the exact binomial test for goodness-of-fit if the sample size is small enough, or a G-test for goodness-of-fit if the sample size is larger. The power analysis is the same for both tests.
Go to the power calculator on the exact binomial test web page. Enter 0.75 for "Proportion if null hypothesis is true" and enter 0.72 for "Proportion if alternative hypothesis is true" (3 percent fewer yellow peas). Enter 0.05 for alpha and 0.90 for power, then click on "Run." The result is 2253. That's a lot of peas! Note that, because the confidence intervals on a percentage are not symmetrical, the results are different if you enter 0.78 for "Proportion if null hypothesis is true"; you should try it both ways and use the larger sample size result.
The example data for Student's t-test shows that the average height in the 2 p.m. section of Biological Data Analysis was 66.6 inches and the average height in the 5 p.m. section was 64.6 inches, but the difference is not significant (P=0.207). You want to know how many students you'd have to sample to have an 80 percent chance of a difference this large being significant. Go to the power calculator on the Student's t-test web page. Enter 2.0 for the difference in means. Using the STDEV function in Excel, calculate the standard deviation for each sample in the original data; it is 4.8 for sample 1 and 3.6 for sample 2. Enter 0.05 for alpha and 0.80 for power. The result is 72, meaning that if 5 p.m. students really were two inches shorter than 2 p.m. students, you'd need 72 students in each class to detect a significant difference 80 percent of the time, if the true difference really is 2.0 inches.
How to do power analyses
I have put sample size calculators on the web pages in this handbook for some of the simpler tests, including the exact binomial test, Student's t-test, and paired t-test. Russ Lenth has put together a more extensive set of power analyses.
G*Power is an excellent free program, available for Mac and Windows, that will do power analyses for a large variety of tests. Calculating the effect size parameter can be the most difficult part of a power analysis, so one advantage of G*Power is that it allows you to calculate the effect size parameter using sample data; click on the "Determine" button next to the box labelled "effect size," fill in data that looks like what you want to detect, and it will calculate the effect size.
SAS has a PROC POWER that you can use for power analyses. You enter the needed parameters (which vary depending on the test) and enter a period (which symbolizes missing data in SAS) for the parameter you're solving for (usually ntotal, the total sample size, or npergroup, the number of samples in each group). I find that G*Power is easier to use than SAS for this purpose, so I don't recommend using SAS for your power analyses.
Sokal and Rohlf, pp. 167-169.
Zar, p. 83.
This page was last revised September 14, 2009. Its address is http://udel.edu/~mcdonald/statsampsize.html. It may be cited as pp. 33-38 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.