Any time you reject a null hypothesis because a P-value is less than your critical value, it's possible that you're wrong; the null hypothesis might really be true, and your significant result might be due to chance. A P-value of 0.05 means that there's a 5 percent chance of getting your observed result, if the null hypothesis were true. It does not mean that there's a 5 percent chance that the null hypothesis is true.
For example, if you do 200 statistical tests, and for all of them the null hypothesis is actually true, you'd expect 10 of the tests to be significant at the P<0.05 level, just due to chance. In that case, you'd have 10 statistically significant results, all of which were false positives. The cost, in time, effort and perhaps money, could be quite high if you based important conclusions on these false positives, and it would at least be embarrassing for you once other people did further research and found that you'd been mistaken.
This problem, that when you do multiple statistical tests, some fraction will be false positives, has received increasing attention in the last few years. This is important for such techniques as the use of microarrays, which make it possible to measure RNA quantities for tens of thousands of genes at once; brain scanning, in which blood flow can be estimated in 100,000 or more three-dimensional bits of brain; and evolutionary genomics, where the sequences of every gene in the genome of two or more species can be compared. There is no universally accepted approach for dealing with the problem of multiple comparisons; it is an area of active research, both in the mathematical details and broader epistomological questions.
Controlling the familywise error rate: Bonferroni correction
The classic approach to the multiple comparison problem is to control the familywise error rate. Instead of setting the critical P-level for significance, or alpha, to 0.05, a lower alpha is used. If the null hypothesis is true for all of the tests, the probability of getting one result that is significant at this new, lower alpha level is 0.05. In other words, if the null hypotheses are true, the probability that the family of tests includes one or more false positives due to chance is 0.05.
The most common way to control the familywise error rate is with the Bonferroni correction. The significance level (alpha) for an individual test is found by dividing the familywise error rate (usually 0.05) by the number of tests. Thus if you are doing 100 statistical tests, the alpha level for an individual test would be 0.05/100=0.0005, and only individual tests with P<0.0005 would be considered significant.
The Bonferroni correction is appropriate when a single false positive in a set of tests would be a problem. For example, let's say you've developed a new chicken feed, MiracleChickTM, and you're comparing it to traditional chicken feed. You give some chickens the traditional feed and some other chickens the MiracleChick, then you compare the following between the two groups of chickens: food consumption, growth rate, egg production, egg size, feces production, phosphorus content of feces, nitrogen content of feces, meat/bone ratio, white meat/dark meat ratio, and general prettiness. If you see a significant improvement in any of these quantities, you'll start marketing the MiracleChick on that basis, but if your significant result turns out to be a false positive, the farmers are going to sue you. You've got ten statistical tests, so there's a good chance that one will be significant at the 0.05 level, even if MiracleChick is exactly the same as traditional food. Using the Bonferroni correction, you'd require the P-value to be less than 0.005, which would reduce your chance of a false positive (and the resulting angry farmers).
The Bonferroni correction assumes that the tests are independent of each other, as when you are comparing sample A with sample B, C with D, E with F etc. If you are comparing sample A with sample B, A with C, A with D, etc., the comparisons are not independent. This occurs when doing unplanned comparisons of means in anova, for which a variety of other techniques have been developed.
While the Bonferroni correction does a good job of controlling the familywise error rate for multiple, independent comparisons, it may lead to a very high rate of false negatives. For example, if you are comparing the expression levels of 100,000 genes between two kinds of cells, using the Bonferroni correction would mean that a t-test for an individual gene would have to have P<0.0000005 to be considered significant. That could mean that only genes with gigantic differences in expression level would be significant; there might be a lot of genes with real, moderate-sized differences that would be overlooked, all because you wanted to be sure that your results did not include a single false negative.
An important issue with the Bonferroni correction is deciding what a "family" of statistical tests is. If you're testing 12 new chicken feeds, and you measure 10 different quantities on the chickens, is each set of 10 tests for a single chicken feed one "family," so your critical P-value is 0.05/10? Or is the whole set of 10 tests on 12 feeds one family, so your critical P-value is 0.05/120? And what if three months later, you test 5 more chicken feeds--now do you go back and test everything against 0.05/170? There is no firm rule on this; you'll have to use your judgement, based on just how bad a false positive would be. Obviously, you should make this decision before you look at the results, otherwise it would be too easy to unconsiously rationalize a family size that gives you the results you want.
Controlling the false discovery rate: Benjamini–Hochberg procedure
An alternative approach is to control the false discovery rate. This is the proportion of "discoveries" (significant results) that are actually false positives. For example, let's say you're using microarrays to compare expression levels for 100,000 genes between liver tumors and normal liver cells. You're going to do additional experiments on any genes that show a significant difference between the normal and tumor cells, and you're willing to accept up to 10 percent of the genes with significant results being false positives; you'll find out they're false positives when you do the followup experiments. In this case, you would set your false discovery rate to 10 percent.
One good technique for controlling the false discovery rate was briefly mentioned by Simes (1986) and developed in detail by Benjamini and Hochberg (1995). Put the individual P-values in order, from smallest to largest. The smallest P-value has a rank of i=1, the next has i=2, etc. Then compare each individual P-value to (i/m)Q, where m is the total number of tests and Q is the chosen false discovery rate. The largest P-value that has P<(i/m)Q is significant, and all P-values smaller than it are also significant.
To illustrate this, here are some data on genotype frequencies in the oyster Crassostrea virginica. McDonald et al. (1996) compared the genotype frequencies of 6 polymorphisms to the frequencies expected under Hardy-Weinberg equilibrium, using goodness-of-fit tests. There were two population samples, so there were a total of twelve P-values, shown here ordered from smallest to largest. The value of (i/m)Q is shown for a false discovery rate of Q=0.20.
Gene Location i P-value (i/m)Q CV7.7 FL 1 0.010 0.017 CVJ5 FL 2 0.032 0.033 CVL1 SC 3 0.07 0.050 CVB2m SC 4 0.07 0.067 CVB1 FL 5 0.20 0.083 CV7.7 SC 6 0.38 0.100 CVB2e FL 7 0.48 0.117 CVB2m FL 8 0.49 0.133 CVB2e SC 9 0.60 0.150 CVB1 SC 10 0.68 0.167 CVJ5 SC 11 0.74 0.183 CVL1 FL 12 0.97 0.200
Reading down the column of P-values, the largest one with P<(i/m)Q is the second one, CVJ5 in Florida, where the individual P value (0.032) is less than the (i/m)Q value of 0.033. Thus the first two tests would be significant. If you used a Bonferroni correction and set the familywise error rate to 0.05, then each individual P-value would be compared to 0.05/12=0.0042, and none would have been significant.
Other, more complicated techniques, such as Reiner et al. (2003), have been developed for controlling false discovery rate that may be more appropriate when there is lack of independence in the data. If you're using microarrays, in particular, you need to become familiar with this topic.
When not to correct for multiple comparisons
The goal of multiple comparisons corrections is to reduce the number of false positives. An inevitable byproduct of this is that you increase the number of false negatives, where there really is an effect but you don't detect it as statistically significant. If false negatives are very costly, you may not want to correct for multiple comparisons. For example, let's say you've gone to a lot of trouble and expense to knock out your favorite gene, mannose-6-phosphate isomerase (MPI), in a strain of mice that spontaneously develop lots of tumors. Hands trembling with excitement, you get the first MPI-/- mice and start measuring things: blood pressure, growth rate, maze-learning speed, bone density, general prettiness, everything you can think of to measure on a mouse. You measure 50 things on MPI-/- mice and normal mice, run tests, and the smallest P-value is 0.013 for a difference in tumor size. If you use either a Bonferroni correction or the Benjamini and Hochberg procedure, that P=0.013 won't be close to significant. Should you conclude that there's no significant difference between the MPI-/- and MPI+/+ mice, write a boring little paper titled "Lack of anything interesting in MPI-/- mice," and look for another project? No, your paper should be "Possible effect of MPI on cancer." You should be suitably cautious, of course, but the cost of a false positive--if further experiments show that MPI really has no effect on tumors--is just a few more experiments. The cost of a false negative, on the other hand, could be that you've missed out on a hugely important discovery.
Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 57: 289-300.
McDonald, J.H., B.C. Verrelli and L.B. Geyer. 1996. Lack of geographic variation in anonymous nuclear polymorphisms in the American oyster, Crassostrea virginica. Mol. Biol. Evol. 13: 1114-1118.
Reiner, A., D. Yekutieli and Y. Benjamini. 2003. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19: 368-375.
Simes, R.J. 1986. An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751-754.
This page was last revised October 10, 2008. Its address is http://udel.edu/~mcdonald/statmultcomp.html. It may be cited as pp. 256-259 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.