# Paired t-test

### When to use it

You use the paired t-test when there is one measurement variable and two nominal variables. One of the nominal variables has only two values. The most common design is that one nominal variable represents different individuals, while the other is "before" and "after" some treatment. Sometimes the pairs are spatial rather than temporal, such as left vs. right, injured limb vs. uninjured limb, above a dam vs. below a dam, etc.

An example would be the performance of undergraduates on a test of manual dexterity before and after drinking a cup of tea. For each student, there would be two observations, one before the tea and one after. I would expect the students to vary widely in their performance, so if the tea decreased their mean performance by 5 percent, it would take a very large sample size to detect this difference if the data were analyzed using a Student's t-test. Using a paired t-test has much more statistical power when the difference between groups is small relative to the variation within groups.

The paired t-test is only appropriate when there is just one observation for each combination of the nominal values. For the tea example, that would be one measurement of dexterity on each student before drinking tea, and one measurement after drinking tea. If you had multiple measurements of dexterity on each student before and after drinking tea, you would do a two-way anova with replication.

### Null hypothesis

The null hypothesis is that the mean difference between paired observations is zero. This is mathematically equivalent to the null hypothesis of a one-way anova or t-test, that the means of the groups are equal, but because of the paired design of the data, the null hypothesis of a paired t-test is usually expressed in terms of the mean difference.

### Assumption

The paired t-test assumes that the differences between pairs are normally distributed; you can use the histogram spreadsheet on that page to check the normality. If the differences between pairs are severely non-normal, it would be better to use the Wilcoxon signed-rank test. I don't think the test is very sensitive to deviations from normality, so unless the deviation from normality is really obvious, I wouldn't worry about it.

### How the test works

The difference between the observations is calculated for each pair, and the mean and standard error of these differences are calculated. Dividing the mean by the standard error of the mean yields a test statistic, ts, that is t-distributed with degrees of freedom equal to one less than the number of pairs.

### Examples

 Northern flicker, Colaptes auratus.

Wiebe and Bortolotti (2002) examined color in the tail feathers of northern flickers. Some of the birds had one "odd" feather that was different in color or length from the rest of the tail feathers, presumably because it was regrown after being lost. They measured the yellowness of one odd feather on each of 16 birds and compared it with the yellowness of one typical feather from the same bird. There are two nominal variables, type of feather (typical or odd) and the individual bird, and one measurement variable, yellowness. Because these birds were from a hybrid zone between red-shafted flickers and yellow-shafted flickers, there was a lot of variation among birds in color, making a paired analysis more appropriate. The difference was significant (P=0.001), with the odd feathers significantly less yellow than the typical feathers (higher numbers are more yellow).

```
Yellowness index

Typical	    Odd
Bird  feather     feather
A    -0.255	   -0.324
B    -0.213	   -0.185
C    -0.190	   -0.299
D    -0.185	   -0.144
E    -0.045	   -0.027
F    -0.025	   -0.039
G    -0.015	   -0.264
H     0.003	   -0.077
I     0.015	   -0.017
J     0.020	   -0.169
K     0.023	   -0.096
L     0.040	   -0.330
M     0.040	   -0.346
N     0.050	   -0.191
O     0.055	   -0.128
P     0.058	   -0.182
```

Wilder and Rypstra (2004) tested the effect of praying mantis excrement on the behavior of wolf spiders. They put 12 wolf spiders in individual containers; each container had two semicircles of filter paper, one semicircle that had been smeared with praying mantis excrement and one without excrement. They observed each spider for one hour, and measured its walking speed while it was on each half of the container. There are two nominal variables, filter paper type (with or without excrement) and the individual spider, and one measurement variable (walking speed). Different spiders may have different overall walking speed, so a paired analysis is appropriate to test whether the presence of praying mantis excrement changes the walking speed of a spider. The mean change in walking speed is almost, but not quite, significantly different from 0 (t=2.11, 11 d.f., P=0.053).

### Graphing the results

If there are a moderate number of pairs, you could either plot each individual value on a bar graph, or plot the differences. Here is one graph in each format for the flicker data:

 Colors of tail feathers in the northern flicker. The graph on the top shows the yellowness index for a "typical" feather with a black bar and an "odd" feather with a gray bar. The graph on the bottom shows the difference (typical − odd).

### Related tests

The paired t-test is mathematically equivalent to one of the hypothesis tests of a two-way anova without replication. The paired t-test is simpler to perform and may be more familiar. A two-way anova would be better if both null hypotheses (equality of means of the two treatments and equality of means of the individuals) were of interest; in a paired t-test, the means of individuals are so likely to be different that there's no point in testing them. A two-way anova would have to be used if the measurements are replicated for the treatment/individual combinations.

If the paired nature of the data is ignored, the data would be analyzed using a one-way anova or a regular t-test. The loss in statistical power can be quite dramatic, however, so this is not a good idea.

One non-parametric analogue of the paired t-test is Wilcoxon signed-rank test. A simpler and even less powerful test is the sign test, which considers only the direction of difference between pairs of observations, not the size of the difference.

### How to do the test

Spreadsheets have a built-in function to perform paired t-tests. Put the "before" numbers in one column, and the "after" numbers in the adjacent column, with the before and after observations from each individual on the same row. Then enter =TTEST(array1, array2, tails, type), where array1 is the first column of data, array2 is the second column of data, tails is normally set to 2 for a two-tailed test, and type is set to 1 for a paired t-test. The result of this function is the P-value of the paired t-test.

#### Web pages

There are web pages to do paired t-tests here, here, here, and here.

#### SAS

To do a paired t-test in SAS, you use PROC TTEST with the PAIRED option. Here is an example using the feather data from above:

```
data feathers;
input bird typical odd;
cards;
A    -0.255	   -0.324
B    -0.213	   -0.185
C    -0.190	   -0.299
D    -0.185	   -0.144
E    -0.045	   -0.027
F    -0.025	   -0.039
G    -0.015	   -0.264
H     0.003	   -0.077
I     0.015	   -0.017
J     0.020	   -0.169
K     0.023	   -0.096
L     0.040	   -0.330
M     0.040	   -0.346
N     0.050	   -0.191
O     0.055	   -0.128
P     0.058	   -0.182
;
proc ttest data=feathers;
paired typical*odd;
run;

```

The results include the following, which shows that the P-value is 0.0010:

```
T-Tests
Difference             DF    t Value    Pr > |t|
typical - odd          15       4.06      0.0010

```

### Power analysis

To estimate the sample sizes needed to detect a mean difference that is significantly different from zero, you need the following:

• the effect size, or the mean difference. In the feather data used above, the mean difference between typical and odd feathers is 0.137 yellowness units.
• the standard deviation of differences. Note that this is not the standard deviation within each group. For example, in the feather data, the standard deviation of the differences is 0.135; this is not the standard deviation among typical feathers, or the standard deviation among odd feathers, but the standard deviation of the differences;
• alpha, or the significance level (usually 0.05);
• beta, the probability of accepting the null hypothesis when it is false (0.80 and 0.90 are common values).

As an example, let's say you want to do a study comparing the redness of typical and odd tail feathers in cardinals. The closest you can find to preliminary data is the Weibe and Bortolotti (2002) paper on yellowness in flickers. They found a mean difference of 0.137 yellowness units, with a standard deviation of 0.135; you arbitrarily decide you want to be able to detect a mean difference of 0.10 redness units in your cardinals. On the form belowshown on the web page, you enter 0.10 for "Mean difference", 0.135 for "Standard deviation of differences", 0.05 for the alpha, and 0.80 for the power. The result is 17, so you'll need a minimum of 17 birds (with two observations per bird).

You can also do a power analysis for a paired t-test using the free program G*Power. Choose "t tests" from the "Test family" menu and "Means: difference between dependent means (matched pairs)" from the "Statistical test" menu. To determine the effect size, click on the Determine button and enter the mean difference and the standard deviation of the difference. Then click on the "Calculate and transfer to main window" button; it calculates the effect size and enters it into the main window. Enter your alpha (usually 0.05) and power (typically 0.80 or 0.90) and hit the Calculate button. The result is the number of pairs of observations.

Mean difference:
Standard deviation of differences:
Alpha (significance level of test):
Power (probability of significant result if alternative hypothesis is true):
Two-tailed test     One-tailed test

Sokal and Rohlf, pp. 698-703, 729-730.

Zar, pp. 161-164.

### References

Picture of northern flicker from Steve Nanz.

Wiebe, K.L., and G.R. Bortolotti. 2002. Variation in carotenoid-based color in northern flickers in a hybrid zone. Wilson Bull. 114: 393-400.

Wilder, S.M., and A.L. Rypstra. 2004. Chemical cues from an introduced predator (Mantodea, Mantidae) reduce the movement and foraging of a native wolf spider (Araneae, Lycosidae) in the laboratory. Environ. Entom. 33: 1032-1036.