Kruskal–Wallis test and Mann–Whitney U test
When to use them
The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova. It is the non-parametric analogue of a one-way anova. A one-way anova may yield inaccurate estimates of the P-value when the data are very far from normally distributed. The Kruskal–Wallis test does not make assumptions about normality. Like most non-parametric tests, it is performed on ranked data, so the measurement observations are converted to their ranks in the overall data set: the smallest value gets a rank of 1, the next smallest gets a rank of 2, and so on. The loss of information involved in substituting ranks for the original values can make this a less powerful test than an anova, so the anova should be used if the data meet the assumptions.
If the original data set actually consists of one nominal variable and one ranked variable, you cannot do an anova and must use the Kruskal–Wallis test.
The Mann–Whitney U-test (also known as the Mann–Whitney–Wilcoxon test, the Wilcoxon rank-sum test; or Wilcoxon two-sample test) is limited to nominal variables with only two values; it is the non-parametric analogue to Student's t-test. It uses a different test statistic (U instead of the H of the Kruskal–Wallis test), but the P-value is mathematically identical to that of a Kruskal–Wallis test. For simplicity, I will only refer to Kruskal–Wallis on the rest of this web page, but everything also applies to the Mann–Whitney U-test.
Null hypothesis
The null hypothesis is that the samples come from populations with identical "locations." Another way of stating this is that the mean ranks of samples from the populations are expected to be the same.
The Kruskal–Wallis test does not test the null hypothesis that the populations have identical means, which is the null hypothesis of a one-way anova. It is therefore incorrect to say something like "The mean amount of substance X was significantly higher in muscle tissue than in liver (Kruskal–Wallis test, P=0.012)." It also does not test the null hypothesis that the populations have equal medians, although you will see this error many places, including some statistics textbooks. To illustrate this point, I made up three sets of numbers with identical means (43.5), identical medians (27.5), and a significant (P=0.025) Kruskal–Wallis test:
Group 1 Group 2 Group 3 1 10 19 2 11 20 3 12 21 4 13 22 5 14 23 6 15 24 7 16 25 8 17 26 9 18 27 46 37 28 47 58 65 48 59 66 49 60 67 50 61 68 51 62 69 52 63 70 53 64 71 342 193 72
Assumptions
The Kruskal–Wallis test does NOT assume that the data are normally distributed; that is its big advantage. It DOES, however, assume that the observations in each group come from populations with the same shape of distribution, so if different groups have have dramatically different shapes (one is skewed to the right and another is skewed to the left, for example), the Kruskal–Wallis test may give inaccurate results. I don't know what to suggest in that situation; maybe you could look into some kind of bootstrap analysis.
Heteroscedasticity is one way in which different groups can have different shaped distributions. If the distributions are normally shaped but highly heteroscedastic, you can use Welch's t-test or Welch's anova. If the distributions are both non-normal and highly heteroscedastic, I don't know what to recommend.
How the test works
When working with a measurement variable, the Kruskal–Wallis test starts by substituting the rank in the overall data set for each measurement value. The smallest value gets a rank of 1, the second-smallest gets a rank of 2, etc. Tied observations get average ranks; thus if there were four identical values occupying the fifth, sixth, seventh and eighth smallest places, all would get a rank of 6.5.
The sum of the ranks is calculated for each group, then the test statistic, H, is calculated. H is given by a rather formidable formula that basically represents the variance of the ranks among groups, with an adjustment for the number of ties. H is approximately chi-square distributed, meaning that the probability of getting a particular value of H by chance, if the null hypothesis is true, is the P value corresponding to a chi-square equal to H; the degrees of freedom is the number of groups minus 1.
If the sample sizes are too small, H does not follow a chi-squared distribution very well, and the results of the test should be used with caution. N less than 5 in each group seems to be the accepted definition of "too small."
A significant Kruskal–Wallis test may be followed up by unplanned comparisons of mean ranks, analogous to the Tukey-Kramer method for comparing means. There is an online calculator for computing the Least Significant Difference in ranks.
Examples
Bolek and Coggins (2003) collected multiple individuals of the toad Bufo americanus,, the frog Rana pipiens, and the salamander Ambystoma laterale from a small area of Wisconsin. They dissected the amphibians and counted the number of parasitic helminth worms in each individual. There is one measurement variable (worms per individual amphibian) and one nominal variable (species of amphibian), and the authors did not think the data fit the assumptions of an anova. The results of a Kruskal–Wallis test were significant (H=63.48, 2 d.f., P=1.6 X 10-14); the mean ranks of worms per individual are significantly different among the three species.
McDonald et al. (1996) examined geographic variation in anonymous DNA polymorphisms (variation in random bits of DNA of no known function) in the American oyster, Crassostrea virginica. They used an estimator of Wright's FST as a measure of geographic variation. They compared the FST values of the six DNA polymorphisms to FST values on 13 proteins from Buroker (1983). The biological question was whether protein polymorphisms would have generally lower or higher FST values than anonymous DNA polymorphisms; if so, it would suggest that natural selection could be affecting the protein polymorphisms. FST has a theoretical distribution that is highly skewed, so the data were analyzed with a Mann–Whitney U-test.
| gene | class | FST |
|---|---|---|
| CVB1 | DNA | -0.005 |
| CVB2m | DNA | 0.116 |
| CVJ5 | DNA | -0.006 |
| CVJ6 | DNA | 0.095 |
| CVL1 | DNA | 0.053 |
| CVL3 | DNA | 0.003 |
| 6Pgd | protein | -0.005 |
| Aat-2 | protein | 0.016 |
| Acp-3 | protein | 0.041 |
| Adk-1 | protein | 0.016 |
| Ap-1 | protein | 0.066 |
| Est-1 | protein | 0.163 |
| Est-3 | protein | 0.004 |
| Lap-1 | protein | 0.049 |
| Lap-2 | protein | 0.006 |
| Mpi-2 | protein | 0.058 |
| Pgi | protein | -0.002 |
| Pgm-1 | protein | 0.015 |
| Pgm-2 | protein | 0.044 |
| Sdh | protein | 0.024 |
The results were not significant (U=0.21, P=0.84), so the null hypothesis that the FST of DNA and protein polymorphisms have the same mean ranks is not rejected.
Graphing the results
It is tricky to know how to visually display the results of a Kruskal–Wallis test. It would be misleading to plot the means or medians on a bar graph, as the Kruskal–Wallis test is not a test of the difference in means or medians. If there are relatively small number of observations, you could put the individual observations on a bar graph, with the value of the measurement variable on the Y axis and its rank on the X axis, and use a different pattern for each value of the nominal variable. Here's an example using the oyster Fst data:
![]() |
| Fst values for DNA and protein polymorphisms in the American oyster. DNA polymorphisms are shown in red. |
If there are larger numbers of observations, you could plot a histogram for each category, all with the same scale, and align them vertically. I don't have suitable data for this handy, so here's an illustration with imaginary data:
![]() |
| Histograms of three sets of numbers. |
Similar tests
One-way anova is more powerful and easier to understand than the Kruskal–Wallis test, so it should be used unless the data are severely non-normal. There is no firm rule about how non-normal data can be before an anova becomes inappropriate.
If the data are normally distributed but heteroscedastic, you can use Welch's t-test (for two groups) or Welch's anova (for more than two groups).
Power analysis
I am not aware of a technique for estimating the sample size needed for a Kruskal–Wallis test.
How to do the test
Spreadsheet
I have put together a spreadsheet to do the Kruskal–Wallis test on up to 20 groups, with up to 1000 observations per group.
Web pages
Richard Lowry has web pages for performing the Kruskal–Wallis test for two groups, three groups, or four groups.
SAS
To do a Kruskal–Wallis test in SAS, use the NPAR1WAY procedure (that's the numeral "one," not the letter "el," in NPAR1WAY). "Wilcoxon" tells the procedure to only do the Kruskal–Wallis test; if you leave that out, you'll get several other statistical tests as well, tempting you to pick the one whose results you like the best. The nominal variable that gives the group names is given with the "class" parameter, while the measurement or rank variable is given with the "var" parameter. Here's an example, using the oyster data from above:
data oysters; input markername $ markertype $ fst; cards; CVB1 DNA -0.005 CVB2m DNA 0.116 CVJ5 DNA -0.006 CVJ6 DNA 0.095 CVL1 DNA 0.053 CVL3 DNA 0.003 6Pgd protein -0.005 Aat-2 protein 0.016 Acp-3 protein 0.041 Adk-1 protein 0.016 Ap-1 protein 0.066 Est-1 protein 0.163 Est-3 protein 0.004 Lap-1 protein 0.049 Lap-2 protein 0.006 Mpi-2 protein 0.058 Pgi protein -0.002 Pgm-1 protein 0.015 Pgm-2 protein 0.044 Sdh protein 0.024 ; proc npar1way data=oysters wilcoxon; class markertype; var fst; run;
The output contains a table of "Wilcoxon scores"; the "mean score" is the mean rank in each group, which is what you're testing the homogeneity of. "Chi-square" is the H-statistic of the Kruskal–Wallis test, which is approximately chi-square distributed. The "Pr > Chi-Square" is your P-value. You would report these results as "H=0.04, 1 d.f., P=0.84."
Wilcoxon Scores (Rank Sums) for Variable fst
Classified by Variable markertype
Sum of Expected Std Dev Mean
markertype N Scores Under H0 Under H0 Score
-----------------------------------------------------------------
DNA 6 60.50 63.0 12.115236 10.083333
protein 14 149.50 147.0 12.115236 10.678571
Kruskal–Wallis Test
Chi-Square 0.0426
DF 1
Pr > Chi-Square 0.8365
Further reading
Sokal and Rohlf, pp. 424-426.
Zar, pp. 195-200.
References
Bolek, M.G., and J.R. Coggins. 2003. Helminth community structure of sympatric eastern American toad, Bufo americanus americanus, northern leopard frog, Rana pipiens, and blue-spotted salamander, Ambystoma laterale, from southeastern Wisconsin. J. Parasit. 89: 673-680.
Buroker, N. E. 1983. Population genetics of the American oyster Crassostrea virginica along the Atlantic coast and the Gulf of Mexico. Mar. Biol. 75:99-112.
McDonald, J.H., B.C. Verrelli and L.B. Geyer. 1996. Lack of geographic variation in anonymous nuclear polymorphisms in the American oyster, Crassostrea virginica. Molecular Biology and Evolution 13: 1114-1118.
⇐ Previous topic | Next topic ⇒
This page was last revised January 5, 2008. Its address is http://udel.edu/~mcdonald/statkruskalwallis.html.
©2007-2008 by John H. McDonald. You can probably do what you want with this content; see the permissions page at http://udel.edu/~mcdonald/statpermissions.html for details.


