I find the descriptions of correlation and regression in most textbooks to be unnecessarily confusing. Some statistics textbooks have correlation and linear regression in separate chapters, and make it seem as if it is important to pick one technique or the other, based on subtle differences in the design and assumptions of the experiment. I think this overemphasizes the differences between them. Other books muddle correlation and regression together, leading the reader puzzled about what the difference is.
My understanding of the two techniques, as they are practiced, is that they primarily differ in goals. The goal of a correlation analysis is to see whether two measurement variables covary, and to measure the strength of any relationship between the variables. The results of correlation are expressed as a P-value (for the hypothesis test) and an r-value (correlation coefficient) or r2 value (coefficient of determination). The goal of linear regression is to find the equation (slope and intercept) of the line that best fits the points; this line is then used as a visual summary of the relationship between the variables, or for estimating unknown values of one variable when given the value of the other.
When you have two measurement variables in biology, you'll usually want to do both correlation and regression—you'll want the P-value of the hypothesis test, and the r2 that describes the strength of the relationship, and the regression line that illustrates the relationship. It would be less confusing if there were a single name for the whole process, just like "anova" includes testing hypotheses, partitioning variance, and estimating means. Since there isn't a single name, one option is to refer to the P-value and r2 as resulting from a correlation analysis, while the equation of the line results from linear regression: "The correlation of variables X and Y is significant (r2=0.89, P=0.007); the linear regression line is shown in the figure." It is also common to say something like "The linear regression of Y on X is significant (r2=0.89, P=0.007)"; either seems appropriate. The one thing you should not do is call a linear regression line a "correlation line"; if that means anything, it means something different from a regression line.
Here I'll treat correlation and linear regression as different aspects of a single analysis. Be aware that this approach will probably be different from what you'll see elsewhere.
When to use them
Correlation and linear regression are used when you have two measurement variables, such as food intake and weight, drug dosage and blood pressure, air temperature and metabolic rate, etc.
There's also one nominal variable that keeps the two measurements together in pairs, such as the name of an individual organism. I'm not aware that anyone else considers this nominal variable to be part of correlation and regression, and it's not something you need to know the value of—you could indicate that a food intake measurement and weight measurement came from the same rat by putting both numbers on the same line, without ever giving the rat a name. For that reason, I'll call it a "hidden" nominal variable.
The data are typically plotted as a scatter of points on a graph, with one variable on the X axis and the other variable on the Y axis. The goals are to find the equation for the line that best fits these points, and to determine whether the slope of this line is significantly different from zero. If the slope is significantly different from zero, there is a significant relationship between the two variables: as the values of one variable increase, the values of the other variable either tend to increase (if the slope is positive) or tend to decrease (if the slope is negative).
There are three main uses for correlation and regression in biology. One is to test hypotheses about cause-and-effect relationships. In this case, the experimenter determines the values of the X-variable and sees whether variation in X causes variation in Y. An example would be giving people different amounts of a drug and measuring their blood pressure. The null hypothesis would be that there was no relationship between the amount of drug and the blood pressure. If the null hypothesis is rejected, the conclusion would be that the amount of drug causes changes in the blood pressure.
The second main use for correlation and regression is to see whether two variables are associated, without necessarily inferring a cause-and-effect relationship. In this case, neither variable is determined by the experimenter; both are naturally variable. If an association is found, the inference is that variation in X may cause variation in Y, or variation in Y may cause variation in X, or variation in some other factor may affect both X and Y. An example would be measurements of the amount of a particular protein on the surface of some cells and the pH of the cytoplasm of those cells. If the protein amount and pH are correlated, it may be that the amount of protein affects the internal pH; or the internal pH affects the amount of protein; or some other factor, such as oxygen concentration, affects both protein concentration and pH. Often, a significant correlation suggests further experiments to test for a cause and effect relationship; if protein concentration and pH were correlated, you might want to manipulate protein concentration and see what happens to pH, or manipulate pH and measure protein, or manipulate oxygen and see what happens to both.
The third common use of linear regression is estimating the value of one variable corresponding to a particular value of the other variable. For example, if you were doing a protein assay you would start by constructing a standard curve. You would add the reagent to known amounts of protein (10, 20, 30 mg, etc.) and measure the absorbance. You would then find the equation for the regression line, with protein amount as the X variable and absorbance as the Y variable. Then when you measure the absorbance of a sample with an unknown amount of protein, you can rearrange the equation of the regression line to solve for X and estimate the amount of protein in the sample.
The null hypothesis is that the slope of the best-fit line is equal to zero; in other words, as the X variable gets larger, the associated Y variable gets neither higher nor lower.
It is also possible to test the null hypothesis that the Y value predicted by the regression equation for a given value of X is equal to some theoretical expectation; the most common would be testing the null hypothesis that the Y intercept is 0. This is rarely necessary in biological experiments, so I won't cover it here, but be aware that it is possible.
If a cause-and-effect relationship is being tested, the variable that causes the relationship is called the independent variable and is plotted on the X axis, while the effect is called the dependent variable and is plotted on the Y axis. In some cases the experimenter determines the value of the independent variable, such as putting frogs in temperature-controlled chambers and measuring their calling rate. In other cases, both variables exhibit natural variation, but any cause-and-effect relationship would be in one way; if you measure the air temperature and frog calling rate at a pond on several different nights, both the air temperature and the calling rate would display natural variation, but if there's a cause-and-effect relationship, it's temperature affecting calling rate; the rate at which frogs call does not affect the air temperature.
Sometimes it's not clear which is the independent variable and which is the dependent. For example, if you measure the salt content of people's food and their blood pressure to test whether eating more salt causes higher blood pressure, you'd want to make salt content the independent variable. But if you thought that high blood pressure caused people to crave high-salt foods, you'd make blood pressure the independent variable.
Sometimes, you're not looking for a cause-and-effect relationship at all; if you measure the range-of-motion of the hip and the shoulder, you're not trying to see whether more flexible hips cause more flexible shoulders, or vice versa, you're just trying to see if people with more flexible hips also tend to have more flexible shoulders, presumably due to some factor (age, diet, exercise, genetics) that affects overall flexibility. In this case, it would be completely arbitrary which variable you put on the X axis and which you put on the Y axis.
Fortunately, the P-value and the r2 are not affected by which variable you call the X and which you call the Y; you'll get mathematically identical values either way. The regression line does depend on which variable is the X and which is the Y; the two lines can be quite different if the r2 is low. If you're truly interested only in whether the two variables covary, and you are not trying to infer a cause-and-effect relationship, you may want to avoid using the linear regression line as decoration on your graph.
In some fields, the independent variable is traditionally plotted on the Y axis. In oceanography, for example, depth is often plotted on the Y axis (with 0 at the top) and a variable that is directly or indirectly affected by depth, such as chlorophyll concentration, is plotted on the X axis. I wouldn't recommend this unless it's a really strong tradition in your field, as it could lead to confusion about which variable is the independent variable in a linear regression.
No "error" in X variable. One assumption of linear regression is that the X variable is set by the experimenter and there is no error, either measurement error or biological variation. If you're only using the regression line to illustrate the relationship between two variables (basically, it's decoration on your graph), violation of this assumption doesn't matter much. If you're trying to accurately predict Y from X or predict X from Y and the X variable has a lot of measurement error or biological variation, you may want to look into different techniques for "model II regression," such as "major axis regression" or "reduced major axis regression," which are not covered here.
Normality and homoscedasticity. Two more assumptions, similar to those for anova, are that for any value of X, the Y values will be normally distributed and they will be homoscedastic. Although you will rarely have enough data to test these assumptions, they are often violated, especially homoscedasticity. If there is a significant regression, X values with higher mean Y values will often have higher variances of Y as well. A data transformation of the Y variable may fix this problem, but if that doesn't work, you can use the non-parametric Spearman rank correlation instead. I don't know how much non-normality or heteroscedasticity are enough to make linear regression and correlation inappropriate.
Linearity. Linear regression assumes that the data fit to a straight line. If this isn't the case, a data transformation may help, or it may be necessary to use polynomial regression.
Independence. Linear regression assumes that the data points are independent of each other, meaning that the value of one data point does not depend on what the value of any other data point is. The most common violation of this assumption is in time series data, where some Y variable has been measured at different times. For example, let's say you've counted the number of elephants in a park in Africa every year. The population either goes up by 10% or goes down by 10% each year, and the direction of change is completely random. The number of elephants in one year is not independent of the number of elephants in the previous year, it is highly dependent on it; if the number of elephants in one year is high, the number in the next year will still be pretty high, even if it goes down by 10%. The direction of change from year to the next is completely random, so you wouldn't expect a significant regression, but this kind of non-independence can give you a "significant" regression much more often than 5% of the time, even when the null hypothesis of no relationship between X and Y is true.
There are special statistical tests for time-series data and other non-independent data (such as data showing spatial autocorrelation). I will not cover them here.
How the test works
|The graph shows the data points (dots), linear regression line (thick line), and data points connected to the point on the regression line with the same X value (thin lines). The regression line is the line that minimizes the sum of the squared vertical distances between the points and the line.|
Linear regression finds the line that best fits the data points. In this case, the "best" fit is defined as the line that minimizes the squared vertical distances between the data points and the line. For a data point with an X value of X1 and a Y value of Y1, the difference between Y1 and the value of Y on the line at X1 is calculated, then squared. This squared deviate is calculated for each data point, and the sum of these squared deviates measures how well a line fits the data. The regression line is the one for which this sum of squared deviates is smallest.
The equation for the regression line is usually expressed as Y=intercept+slope×X. This equation can be used to predict the value of Y for a given value of X. You can also predict X from Y, using the equation X=(Y−intercept)/slope. These predictions are best done within the range of X and Y values observed in the data (interpolation). Predicting Y or X values outside the range of observed values (extrapolation) is sometimes interesting, but it can easily yield ridiculous results. For example, in the frog example below, you could mathematically predict that the inter-call interval would be about 16 seconds at -40°C. Actually, the frogs would not call at that temperature; they'd be dead.
Coefficient of determination (r2)
|Three relationships with the same slope, same intercept, and different amounts of scatter around the best-fit line.|
The coefficient of determination, or r2, expresses the strength of the relationship between the X and Y variables. It is the proportion of the variation in the Y variable that is "explained" by the variation in the X variable. r2 can vary from 0 to 1; values near 1 mean the Y values fall almost right on the regression line, while values near 0 mean there is very little relationship between X and Y. As you can see, regressions can have a small r2 and not look like there's any relationship, yet they still might have a slope that's significantly different from zero.
To illustrate the meaning of r2, here are six pairs of X and Y values:
deviate squared X Y from mean deviate --- --- --------- ------- 1 2 8 64 3 9 1 1 5 9 1 1 6 11 1 1 7 14 4 16 9 15 5 25 ---- sum of squares: 108
If you didn't know anything about the X value and were told to guess what a Y value was, your best guess would be the mean Y; for this example, the mean Y is 10. The squared deviates of the Y values from their mean is the total sum of squares, familiar from analysis of variance. The vertical lines on the left graph below show the deviates from the mean; the first point has a deviate of 8, so its squared deviate is 64, etc. The total sum of squares for these numbers is 64+1+1+1+16+25=108.
|Deviations from the mean Y and from the regression line.|
|Deviations from the mean Y and from the regression line.|
If you did know the X value and were told to guess what a Y value was, you'd calculate the regression equation and use it. The regression equation for these numbers is Y=1.5429×X+2.0286, so for the first X value you'd predict a Y value of 1.5429×1+2.0286=3.5715, etc. The vertical lines on the right graph above show the deviates of the actual Y values from the predicted Y values. As you can see, most of the points are closer to the regression line than they are to the overall mean. Squaring these deviates and taking the sum gives us the regression sum of squares, which for these numbers is 10.8.
predicted deviate squared X Y Y-value from predicted deviate --- --- --------- -------------- ------- 1 2 3.57 1.57 2.46 3 9 6.66 2.34 5.48 5 9 9.74 0.74 0.55 6 11 11.29 0.29 0.08 7 14 12.83 1.17 1.37 9 15 15.91 0.91 0.83 ------ regression sum of squares: 10.8
The regression sum of squares is 10.8, which is 90% smaller than the total sum of squares (108). This difference between the two sums of squares, expressed as a fraction of the total sum of squares, is the r2. In this case we would say that r2=0.90; the X-variable "explains" 90% of the variation in the Y-variable.
The r2 value is formally known as the "coefficient of determination," although it is usually just called r2. The square root of r2, with a negative sign if the slope is negative, is the Pearson product-moment correlation coefficient, or just "correlation coefficient." Either r or r2 can be used to describe the strength of the association between two variables, but I recommend r2, because it has a more understandable meaning (the proportional difference between total sum of squares and regression sum of squares) and doesn't have those annoying negative values.
The test statistic for a linear regression is ts=d.f.×r2/(1−r2). It gets larger as the degrees of freedom (n−2) get larger or the r2 gets larger. Under the null hypothesis, ts is t-distributed with n−2 degrees of freedom. When reporting the results of a linear regression, it is conventional to report just the r2 and degrees of freedom, not the ts value. Anyone who really needs the ts value can calculate it from the r2 and degrees of freedom.
It is also possible to square ts and get an F-statistic with 1 degree of freedom in the numerator and n−2 degrees of freedom in the denominator. The resulting P-value is mathematically identical to that calculated with ts.
A common observation in ecology is that species diversity decreases as you get further from the equator. To see whether this pattern could be seen on a small scale, I used data from the Audobon Society's Christmas Bird Count, in which birders try to count all the birds in a 15-mile diameter area during one winter day. I looked at the total number of species seen in each area on the Delmarva Peninsula during the 2005 count. Latitude and number of bird species are the two measurement variables; location is the hidden nominal variable.
Location Latitude Number of species Bombay Hook, DE 39.217 128 Cape Henlopen, DE 38.800 137 Middletown, DE 39.467 108 Milford, DE 38.958 118 Rehoboth, DE 38.600 135 Seaford-Nanticoke, DE 38.583 94 Wilmington, DE 39.733 113 Crisfield, MD 38.033 118 Denton, MD 38.900 96 Elkton, MD 39.533 98 Lower Kent County, MD 39.133 121 Ocean City, MD 38.317 152 Salisbury, MD 38.333 108 S. Dorchester County, MD 38.367 118 Cape Charles, VA 37.200 157 Chincoteague, VA 37.967 125 Wachapreague, VA 37.667 114
|Latitude and bird species on the Delmarva Peninsula.|
|Latitude and bird species on the Delmarva Peninsula.|
The result is r2=0.214, with 15 d.f., so the P-value is 0.061. The trend is in the expected direction, but it is not quite significant. The equation of the regression line is number of species=−12.039(latitude)+585.14. Even if it were significant, I don't know what you'd do with the equation; I suppose you could extrapolate and use it to predict that above the 49th parallel, there would be fewer than zero bird species.
|Relationship of body temperature and inter-call interval in the gray tree frog.|
|Relationship of body temperature and inter-call interval in the gray tree frog.|
Gayou (1984) measured the intervals between male mating calls in the gray tree frog, Hyla versicolor, at different temperatures. The regression line is interval=−0.205(temperature)+8.36, and it is highly significant (r2=0.29, 45 d.f., p=9×10−5). You could rearrange the equation, temperature=(interval−8.36)/(−0.205), measure the interval between frog mating calls, and estimate the air temperature. Or you could buy a thermometer.
Goheen et al. (2003) captured 14 female northern grasshopper mice (Onchomys leucogaster) in north-central Kansas, measured the body length, and counted the number of offspring. There are two measurement variables, body length and number of offspring, and the authors were interested in whether larger body size causes an increase in the number of offspring, so they did a linear regression. The results are significant: r2=0.46, 12 d.f., P=0.008. The equation of the regression line is offspring=−7.88+0.108(length).
Graphing the results
In a spreadsheet, you show the results of a regression on a scatter graph, with the independent variable on the X axis. To add the regression line to the graph, finish making the graph, then select the graph and go to the Chart menu. Choose "Add Trendline" and choose the straight line. If you want to show the regression line extending beyond the observed range of X-values, choose "Options" and adjust the "Forecast" numbers until you get the line you want.
Sometimes it is not clear whether an experiment includes one measurement variable and two nominal variables, and should be analyzed with a two-way anova or paired t-test, or includes two measurement variables and one nominal variable, and should be analyzed with correlation and regression. In that case, your choice of test is determined by the biological question you're interested in. For example, let's say you've measured the range of motion of the right shoulder and left shoulder of a bunch of right-handed people. If your question is "Is there an association between the range of motion of people's right and left shoulders--do people with more flexible right shoulders also tend to have more flexible left shoulders?", you'd treat "right shoulder range-of-motion" and "left shoulder range-of-motion" as two different measurement variables, and individual as one nominal variable, and analyze with correlation and regression. If your question is "Is the right shoulder more flexible than the left shoulder?", you'd treat "range of motion" as one measurement variable, "right vs. left" as one nominal variable, individual as one nominal variable, and you'd analyze with two-way anova or a paired t-test.
If the dependent variable is a percentage, such as percentage of people who have heart attacks on different doses of a drug, it's really a nominal variable, not a measurement. Each individual observation is a value of the nominal variable ("heart attack" or "no heart attack"); the percentage is not really a single observation, it's a way of summarizing a bunch of observations. One approach for percentage data is to arcsine transform the percentages and analyze with correlation and linear regression. You'll see this in the literature, and it's not horrible, but it's better to analyze using logistic regression.
If the relationship between the two measurement variables is best described by a curved line, not a straight one, one possibility is to try different transformations on one or both of the variables. The other option is to use polynomial regression (also known as curvilinear regression).
Linear regression assumes that the Y variables for any value of X would be normally distributed and homoscedastic; if these assumptions are violated, Spearman rank correlation, the non-parametric analog of linear regression, may be used.
How to do the test
I have put together a spreadsheet to do linear regression on up to 1000 pairs of observations. It provides the following:
- The regression coefficient (the slope of the regression line).
- The Y-intercept. With the slope and the intercept, you have the equation for the regression line: Y=a+bX, where a is the Y intercept and b is the slope.
- The r2 value.
- The degrees of freedom. There are n−2 degrees of freedom in a regression, where n is the number of observations.
- The P-value. This gives you the probability of finding a slope that is as large or larger than the observed slope, under the null hypothesis that the true slope is 0.
- A Y-estimator and an X-estimator. This enables you to enter a value of X and find the corresponding value of Y on the best-fit line, or vice-versa. This would be useful for constructing standard curves, such as used in protein assays for example.
Web pages that will perform linear regression are here, here, and here. They all require you to enter each number individually, and thus are inconvenient for large data sets. This web page does linear regression and lets you paste in a set of numbers, which is more convenient for large data sets.
You can use either PROC GLM or PROC REG for a simple linear regression; since PROC REG is also used for multiple regression, you might as well learn to use it. Here's an example using the bird data from above.
data birds; input town $ state $ latitude species; cards; Bombay_Hook DE 39.217 128 Cape_Henlopen DE 38.800 137 Middletown DE 39.467 108 Milford DE 38.958 118 Rehoboth DE 38.600 135 Seaford-Nanticoke DE 38.583 94 Wilmington DE 39.733 113 Crisfield MD 38.033 118 Denton MD 38.900 96 Elkton MD 39.533 98 Lower_Kent_County MD 39.133 121 Ocean_City MD 38.317 152 Salisbury MD 38.333 108 S_Dorchester_County MD 38.367 118 Cape_Charles VA 37.200 157 Chincoteague VA 37.967 125 Wachapreague VA 37.667 114 ; proc reg data=birds; model species=latitude; run;
The output includes an analysis of variance table. Don't be alarmed by this; if you dig down into the math, regression is just another variety of anova. Below the anova table are the r2, slope, intercept, and P-value:
Root MSE 16.37357 R-Square 0.2143 r2 Dependent Mean 120.00000 Adj R-Sq 0.1619 Coeff Var 13.64464 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| intercept Intercept 1 585.14462 230.02416 2.54 0.0225 latitude 1 -12.03922 5.95277 -2.02 0.0613 P-value slope
These results indicate an r2 of 0.21, intercept of 585.1, a slope of −12.04, and a P-value of 0.061.
The G*Power program will calculate the sample size needed for a regression/correlation. The effect size is the absolute value of the correlation coefficient r; if you have r2, take the positive square root of it. Choose "t tests" from the "Test family" menu and "Correlation: Point biserial model" from the "Statistical test" menu. Enter the r-value you hope to see, your alpha (usually 0.05) and your power (usually 0.80 or 0.90).
For example, let's say you want to look for a relationship between calling rate and temperature in the barking tree frog, Hyla gratiosa. Gayou (1984) found an r2 of 0.29 in the H. versicolor, so you decide you want to be able to detect an r2 of 0.25 or more. The square root of 0.25 is 0.5, so you enter 0.5 for "Effect size", 0.05 for alpha, and 0.8 for power. The result is 26 observations of temperature and frog calling rate.
It's important to note that the distribution of X variables, in this case air temperatures, should be the same for the proposed study as for the pilot study the sample size calculation was based on. Gayou (1984) measured frog calling rate at temperatures that were fairly evenly distributed from 10°C to 34°C. If you looked at a narrower range of temperatures, you'd need a lot more observations to detect the same kind of relationship.
Sokal and Rohlf, pp. 451-471, 486-493.
Zar, pp. 324-358, 377-386.
Gayou, D.C. 1984. Effects of temperature on the mating call of Hyla versicolor. Copeia 1984: 733-738.
Goheen, J.R., G.A. Kaufman, and D.W. Kaufman. 2003. Effect of body size on reproductive characteristics of the northern grasshopper mouse in north-central Kansas. Southwest. Naturalist 48: 427-431.
This page was last revised September 14, 2009. Its address is http://udel.edu/~mcdonald/statregression.html. It may be cited as pp. 207-220 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.