# Basic concepts of hypothesis testing

### Null hypothesis

 A giant concrete chicken in Lang Con Ca, Vietnam.

The null hypothesis is a statement that you want to test. In general, the null hypothesis is that things are the same as each other, or the same as a theoretical expectation. For example, if you measure the size of the feet of male and female chickens, the null hypothesis could be that the average foot size in male chickens is the same as the average foot size in female chickens. If you count the number of male and female chickens born to a set of hens, the null hypothesis could be that the ratio of males to females is equal to the theoretical expectation of a 1:1 ratio.

The alternative hypothesis is that things are different from each other, or different from a theoretical expectation. For example, one alternative hypothesis would be that male chickens have a different average foot size than female chickens; another would be that the sex ratio is different from 1:1.

Usually, the null hypothesis is boring and the alternative hypothesis is interesting. Finding that male chickens have bigger feet than female chickens might lead to all kinds of exciting discoveries about developmental biology, endocrine physiology, or sexual selection in chickens. Finding that male and female chickens have the same size feet wouldn't lead to anything except a boring paper in the world's most obscure chicken journal. It's therefore tempting to look for patterns in your data that support the exciting alternative hypothesis. For example, you might measure the feet of 10 male chickens and 10 female chickens and find that the mean is 0.1 mm longer for males. You're almost certain to get some difference in the means, just due to chance, so before you get all happy and start buying formal wear for the Nobel Prize ceremony, you need to ask "What's the probability of getting a difference in the means of 0.1 mm, just by chance, if the boring null hypothesis is really true?" Only when that probability is low can you reject the null hypothesis. The goal of statistical hypothesis testing is to estimate the probability of getting your observed results under the null hypothesis.

### Biological vs. statistical null hypotheses

It is important to distinguish between biological null and alternative hypotheses and statistical null and alternative hypotheses. "Sexual selection by females has caused male chickens to evolve bigger feet than females" is a biological alternative hypothesis; it says something about biological processes, in this case sexual selection. "Male chickens have a different average foot size than females" is a statistical alternative hypothesis; it says something about the numbers, but nothing about what caused those numbers to be different. The biological null and alternative hypotheses are the first that you should think of, as they describe something interesting about biology; they are two possible answers to the biological question you are interested in ("What affects foot size in chickens?"). The statistical null and alternative hypotheses are statements about the data that should follow from the biological hypotheses: if sexual selection favors bigger feet in male chickens (a biological hypothesis), then the average foot size in male chickens should be larger than the average in females (a statistical hypothesis). If you reject the statistical null hypothesis, you then have to decide whether that's enough evidence that you can reject your biological null hypothesis. For example, if you don't find a significant difference in foot size between male and female chickens, you could conclude "There is no significant evidence that sexual selection has caused male chickens to have bigger feet." If you do find a statistically significant difference in foot size, that might not be enough for you to conclude that sexual selection caused the bigger feet; it might be that males eat more, or that the bigger feet are a developmental byproduct of the roosters' combs, or that males run around more and the exercise makes their feet bigger. When there are multiple biological interpretations of a statistical result, you need to think of additional experiments to test the different possibilities.

### Testing the null hypothesis

The primary goal of a statistical test is to determine whether an observed data set is so different from what you would expect under the null hypothesis that you should reject the null hypothesis. For example, let's say you've given up on chicken feet and now are studying sex determination in chickens. For breeds of chickens that are bred to lay lots of eggs, female chicks are more valuable than male chicks, so if you could figure out a way to manipulate the sex ratio, you could make a lot of chicken farmers very happy. You've tested a treatment, and you get 25 female chicks and 23 male chicks. Anyone would look at those numbers and see that they could easily result from chance; there would be no reason to reject the null hypothesis of a 1:1 ratio of females to males. If you tried a different treatment and got 47 females and 1 male, most people would look at those numbers and see that they would be extremely unlikely to happen due to luck, if the null hypothesis were true; you would reject the null hypothesis and conclude that your treatment really changed the sex ratio. However, what if you had 31 females and 17 males? That's definitely more females than males, but is it really so unlikely to occur due to chance that you can reject the null hypothesis? To answer that, you need more than common sense, you need to calculate the probability of getting a deviation that large due to chance.

### P-values

 Probability of getting different numbers of males out of 48, if the parametric proportion of males is 0.5.
 Probability of getting different numbers of males out of 48, if the parametric proportion of males is 0.5.

In the figure above, the BINOMDIST function of Excel was used to calculate the probability of getting each possible number of males, from 0 to 48, under the null hypothesis that 0.5 are male. As you can see, the probability of getting 17 males out of 48 total chickens is about 0.015. That seems like a pretty small probability, doesn't it? However, that's the probability of getting exactly 17 males. What you want to know is the probability of getting 17 or fewer males. If you were going to accept 17 males as evidence that the sex ratio was biased, you would also have accepted 16, or 15, or 14,… males as evidence for a biased sex ratio. You therefore need to add together the probabilities of all these outcomes. The probability of getting 17 or fewer males out of 48, under the null hypothesis, is 0.030. That means that if you had an infinite number of chickens, half males and half females, and you took a bunch of random samples of 48 chickens, 3.0% of the samples would have 17 or fewer males.

This number, 0.030, is the P-value. It is defined as the probability of getting the observed result, or a more extreme result, if the null hypothesis is true. So "P=0.030" is a shorthand way of saying "The probability of getting 17 or fewer male chickens out of 48 total chickens, IF the null hypothesis is true that 50 percent of chickens are male, is 0.030."

### Significance levels

Does a probability of 0.030 mean that you should reject the null hypothesis, and conclude that your treatment really caused a change in the sex ratio? The convention in most biological research is to use a significance level of 0.05. This means that if the probability value (P) is less than 0.05, you reject the null hypothesis; if P is greater than or equal to 0.05, you don't reject the null hypothesis. There is nothing mathematically magic about 0.05; people could have agreed upon 0.04, or 0.025, or 0.071 as the conventional significance level.

The significance level you use depends on the costs of different kinds of errors. With a significance level of 0.05, you have a 5 percent chance of rejecting the null hypothesis, even if it is true. If you try 100 treatments on your chickens, and none of them really work, 5 percent of your experiments will give you data that are significantly different from a 1:1 sex ratio, just by chance. This is called a "Type I error," or "false positive." If there really is a deviation from the null hypothesis, and you fail to reject it, that is called a "Type II error," or "false negative." If you use a higher significance level than the conventional 0.05, such as 0.10, you will increase your chance of a false positive to 0.10 (therefore increasing your chance of an embarrassingly wrong conclusion), but you will also decrease your chance of a false negative (increasing your chance of detecting a subtle effect). If you use a lower significance level than the conventional 0.05, such as 0.01, you decrease your chance of an embarrassing false positive, but you also make it less likely that you'll detect a real deviation from the null hypothesis if there is one.

You must choose your significance level before you collect the data, of course. If you choose to use a different signifigance level than the conventional 0.05, be prepared for some skepticism; you must be able to justify your choice. If you were screening a bunch of potential sex-ratio-changing treatments, the cost of a false positive would be the cost of a few additional tests, which would show that your initial results were a false positive. The cost of a false negative, however, would be that you would miss out on a tremendously valuable discovery. You might therefore set your significance value to 0.10 or more. On the other hand, once your sex-ratio-changing treatment is undergoing final trials before being sold to farmers, you'd want to be very confident that it really worked, not that you were just getting a false positive. Otherwise, if you sell the chicken farmers a sex-ratio treatment that turns out to not really work (it was a false positive), they'll sue the pants off of you. Therefore, you might want to set your significance level to 0.01, or even lower. Throughout this handbook, I will always use P<0.05 as the significance level.

### One-tailed vs. two-tailed probabilities

The probability that was calculated above, 0.030, is the probability of getting 17 or fewer males out of 48. It would be significant, using the conventional P<0.05 criterion. However, what about the probability of getting 17 or fewer females? If your null hypothesis is "The proportion of males is 0.5 or more" and your alternative hypothesis is "The proportion of males is less than 0.5," then you would use the P=0.03 value found by adding the probabilities of getting 17 or fewer males. This is called a one-tailed probability, because you are adding the probabilities in only one tail of the distribution shown in the figure. However, if your null hypothesis is "The proportion of males is 0.5", then your alternative hypothesis is "The proportion of males is different from 0.5." In that case, you should add the probability of getting 17 or fewer females to the probability of getting 17 or fewer males. This is called a two-tailed probability. If you do that with the chicken result, you get P=0.06, which is not quite significant.

You should decide whether to use the one-tailed or two-tailed probability before you collect your data, of course. A one-tailed probability is more powerful, in the sense of having a lower chance of false negatives, but you should only use a one-tailed probability if you really, truly have a firm prediction about which direction of deviation you would consider interesting. In the chicken example, you might be tempted to use a one-tailed probability, because you're only looking for treatments that decrease the proportion of worthless male chickens. But if you accidentally found a treatment that produced 87 percent male chickens, would you really publish the result as "The treatment did not cause a significant decrease in the proportion of male chickens"? Probably not. You'd realize that this unexpected result, even though it wasn't what you and your farmer friends wanted, would be very interesting to other people; by leading to discoveries about the fundamental biology of sex-determination in chickens, in might even help you produce more female chickens someday. Any time a deviation in either direction would be interesting, you should use the two-tailed probability. In addition, people are skeptical of one-tailed probabilities, especially if a one-tailed probability is significant and a two-tailed probability would not be significant (as in the chicken example). Unless you provide a very convincing explanation, people may think you decided to use the one-tailed probability after you saw that the two-tailed probability wasn't quite significant. It may be easier to always use two-tailed probabilities. For this handbook, I will always use two-tailed probabilities, unless I make it very clear that only one direction of deviation from the null hypothesis would be interesting.

In the olden days, when people looked up P-values in printed tables, they would report the results of a statistical test as "P<0.05", "P<0.01", "P>0.10", etc. Nowadays, almost all computer statistics programs give the exact P value resulting from a statistical test, such as P=0.029, and that's what you should report in your publications. You will conclude that the results are either significant or they're not significant; they either reject the null hypothesis (if P is below your pre-determined significance level) or don't reject the null hypothesis (if P is above your significance level). But other people will want to know if your results are "strongly" significant (P much less than 0.05), which will give them more confidence in your results than if they were "barely" significant (P=0.043, for example). In addition, other researchers will need the exact P value if they want to combine your results with others into a meta-analysis.

Sokal and Rohlf, pp. 157-169.

Zar, pp. 79-85.

### Reference

Picture of giant concrete chicken from Sue and Tony's Photo Site.