# Multiple regression

### When to use it

You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables. The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the X variables. There is also a"hidden" nominal variable that groups the measurement variables together.

#### Multiple regression for prediction

One use of multiple regression is prediction or estimation of an unknown Y value corresponding to a set of X values. For example, let's say you're interested in finding suitable habitat to reintroduce the rare beach tiger beetle, *Cicindela dorsalis dorsalis*, which lives on sandy beaches on the Atlantic coast of North America. You've gone to a number of beaches that already have the beetles and measured the density of tiger beetles (the dependent variable) and several biotic and abiotic factors, such as wave exposure, sand particle size, beach steepness, density of amphipods and other prey organisms, etc. Multiple regression would give you an equation that would relate the tiger beetle density to a function of all the other variables. Then if you went to a beach that doesn't have tiger beetles and measured all the independent variables (wave exposure, sand particle size, etc.) you could use the multiple regression equation to predict the density of tiger beetles that could live there if you introduced them.

#### Multiple regression for understanding causes

A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. For example, if you did a regression of tiger beetle density on sand particle size by itself, you would probably see a significant relationship. If you did a regression of tiger beetle density on wave exposure by itself, you would probably see a significant relationship. However, sand particle size and wave exposure are correlated; beaches with bigger waves tend to have bigger sand particles. Maybe sand particle size is really important, and the correlation between it and wave exposure is the only reason for a significant regression between wave exposure and beetle density. Multiple regression is a statistical way to try to control for this; it can answer questions like "If sand particle size (and every other measured variable) were the same, would the regression of beetle density on wave exposure be significant?"

### Null hypothesis

The main null hypothesis of a multiple regression is that there is no relationship between the X variables and the Y variables, that the fit of the observed Y values to those predicted by the multiple regression equation is no better than what you would expect by chance. As you are doing a multiple regression, there is also a null hypothesis for each X variable, that adding that X variable to the multiple regression does not improve the fit of the multiple regression equation any more than expected by chance.

### How it works

The basic idea is that an equation is found, like this:

Y_{exp}=a+b_{1}X_{1}+b_{2}X_{2}+b_{3}X_{3}...

The Y_{exp} is the expected value of Y for a given set of X values. b_{1} is the estimated slope of a regression of Y on X_{1}, if all of the other X variables could be kept constant, and so on for b_{2}, b_{3}, etc; *a* is the intercept. Values of b_{1}, etc. (the "partial regression coefficients") and the intercept are found that minimize the squared deviations between the expected and observed values of Y.

How well the equation fits the data is expressed by R^{2}, the "coefficient of multiple determination." This can range from 0 (for no relationship between the X and Y variables) to 1 (for a perfect fit, no difference between the observed and expected Y values). The P-value is a function of the R^{2}, the number of observations, and the number of X variables.

When the purpose of multiple regression is prediction, the important result is an equation containing partial regression coefficients. If you had the partial regression coefficients and measured the X variables, you could plug them into the equation and predict the corresponding value of Y. The magnitude of the partial regression coefficient depends on the unit used for each variable, so it does not tell you anything about the relative importance of each variable.

When the purpose of multiple regression is understanding functional relationships, the important result is an equation containing *standard* partial regression coefficients, like this:

y'_{exp}=a+b'_{1}x'_{1}+b'_{2}x'_{2}+b'_{3}x'_{3}...

where b'_{1} is the standard partial regression coefficient of y on X_{1}. It is the number of standard deviations that Y would change for every one standard deviation change in X_{1}, if all the other X variables could be kept constant. The magnitude of the standard partial regression coefficients tells you something about the relative importance of different variables; X variables with bigger standard partial regression coefficients have a stronger relationship with the Y variable.

### Selecting variables in multiple regression

Every time you add a variable to a multiple regression, the R^{2} increases (unless the variable is a simple linear function of one of the other variables, in which case R^{2} will stay the same). The best-fitting model is therefore the one that includes all of the X variables. However, whether the purpose of a multiple regression is prediction or understanding functional relationships, it is often useful to decide which are important and unimportant variables. In the tiger beetle example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only sand particle size and amphipod density, rather than measuring a dozen difficult variables. If your purpose was understanding possible causes, knowing that certain variables did not explain much of the variation in tiger beetle density could suggest that they are probably not important causes of the variation in beetle density.

One way to choose variables, called forward selection, is to do a linear regression for each of the X variables, one at a time, then pick the X variable that had the highest R^{2}. Next you do a multiple regression with the X variable from step 1 and each of the other X variables. The X variable that increases the R^{2} by the greatest amount is added, if the P-value of the increase in R^{2} is below the desired cutoff. This procedure continues until adding another X variable does not significantly increase the R^{2}.

To calculate the P-value of an increase in R^{2} when increasing the number of X variables from *d* to *e*, where the total sample size is *n*, use the formula:

F_{s}=(R^{2}_{e}−R^{2}_{d})/(e−d) —————————————— (1−R^{2}_{e})/(n−e−1)

A second technique, called backward elimination, is to start with a multiple regression using all of the X variables, then perform multiple regressions with each X variable removed in turn. The X variable whose removal causes the smallest decrease in R^{2} is eliminated. This process continues until removal of any X variable would cause a significant decrease in R^{2}.

Odd things can happen when using either of the above techniques. You could add variables X_{1}, X_{2}, X_{3}, and X_{4}, with a significant increase in R^{2} at each step, then find that once you've added X_{3} and X_{4}, you can remove X_{1} with little decrease in R^{2}. It is possible to do multiple regression with independent variables A, B, C, and D, and have forward selection choose variables A and B, and backward elimination choose variables C and D. To avoid this, many people use stepwise multiple regression. After adding each X variable, the effects of removing any of the other X variables is tested. This continues until adding new X variables does not significantly increase R^{2} and removing X variables does not significantly decrease it.

#### Important warning

It is easy to throw a big data set at a multiple regression and get an impressive-looking output. However, many people are skeptical of the usefulness of multiple regression, especially for variable selection, and you should view the results with caution. You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the biology. You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing.

If independent variables A and B are both correlated with Y, and A and B are highly correlated with each other, only one may contribute significantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no biological importance.For example, let's say you did a multiple regression on vertical leap in children five to 12 years old, with height, weight, age and score on a reading test as independent variables. All four independent variables are highly correlated in children, since older children are taller, heavier and read better, so it's possible that once you've added weight and age to the model, there is so little variation left that the effect of height is not significant. It would be biologically silly to conclude that height had no influence on vertical leap. Because reading ability is correlated with age, it's possible that it would contribute significantly to the model; that might suggest some interesting followup experiments on children all of the same age, but it would be unwise to conclude that there was a real effect of reading ability and vertical leap based solely on the multiple regression.

### Example

I extracted some data from the Maryland Biological Stream Survey to practice multiple regression on; the data are shown below in the SAS example. The dependent variable is the number of longnose dace (*Rhinichthys cataractae*) per 75-meter section of stream. The independent variables are the area (in acres) drained by the stream; the dissolved oxygen (in mg/liter); the maximum depth (in cm) of the 75-meter segment of stream; nitrate concetration (mg/liter); sulfate concentration (mg/liter); and the water temperature on the sampling date (in degrees C).

One biological goal might be to measure the physical and chemical characteristics of a stream and be able to predict the abundance of longnose dace; another goal might be to generate hypotheses about the causes of variation in longnose dace abundance.

The results of a stepwise multiple regression, with P-to-enter and P-to-leave both equal to 0.15, is that acreage, nitrate, and maximum depth contribute to the multiple regression equation. The R^{2} of the model including these three terms is 0.28, which isn't very high.

### Graphing the results

If the multiple regression equation ends up with only two independent variables, you might be able to draw a three-dimensional graph of the relationship. Because most humans have a hard time visualizing four or more dimensions, there's no good visual way to summarize all the information in a multiple regression with three or more independent variables. It could be useful to plot a scattergraph with the predicted values on the X-axis and the observed values on the Y-axis. For the longnose dace, I set up a spreadsheet with acreage in column C, maximum depth in column E, and nitrate in column F. Then I put the following equation (in Excel format) in column J, row 2, and copied it into each cell in column J:

=0.00199*C2+0.3361*E2+8.67304*F2-23.82907

Observed abundance of longnose dace vs. the abundance predicted from the multiple regression equation. |

If the multiple regression were perfect, the points would fall on the diagonal dashed line; I made the graph square, with the same scale on the X and Y axis, to emphasize this. The graph makes it easy to see that the multiple regression equation doesn't do a very good job of predicting longnose dace abundance; either other factors that haven't been included in the model are important, or there's a lot of stochasticity in longnose dace abundance.

### Similar tests

There are dozens of other multivariate statistical techniques that have been developed, and picking the most appropriate one for your experiment, and interpreting the results, can be difficult. My goal here is mainly to help you understand the results of the most common technique, multiple regression; if you want to actually use multivariate techniques, you're going to have to do a lot of reading in more specialized texts and consult with experts.

### How to do multiple regression

#### Spreadsheet

If you're serious about doing multiple regressions as part of your research, you're going to have to learn a specialized statistical program such as SAS or SPSS. I've written a spreadsheet that will enable you to do a multiple regression with up to 12 X variables and up to 1000 observations. It's fun to play with, but I'm not confident enough in it that I'd recommend using it for publishable results. The spreadsheet includes histograms to help you decide whether to transform your variables, and scattergraphs of the Y variable vs. each X variable so you can see if there are any non-linear relationships. It doesn't do variable selection automatically, you manually choose which variables to include.

**The spreadsheet had an error that would have given incorrect results for data sets with more than 100 observations. I have corrected the spreadsheet on March 27, 2013. Thanks to Linus Johnson for pointing out the error.**

#### Web pages

VassarStat, Rweb and AutoFit are three web pages that are supposed to perform multiple regression, but I haven't been able to get them to work on my computer.

#### SAS

You use PROC REG to do multiple regression in SAS. Here is an example using the data on longnose dace abundance described above.

data fish; var stream $ longnosedace acreage do2 maxdepth no3 so4 temp; cards; BASIN_RUN 13 2528 9.6 80 2.28 16.75 15.3

** ====See the web page for the full data set====**

WATTS_BR 19 510 6.7 82 5.25 14.19 26.5 ; proc reg data=fish; model longnosedace=acreage do2 maxdepth no3 so4 temp / selection=stepwise slentry=0.15 slstay=0.15 details=summary stb; run;

In the MODEL statement, the dependent variable is to the left of the equals sign, and all the independent variables are to the right. SELECTION determines which variable selection method is used; choices include FORWARD, BACKWARD, STEPWISE, and several others. You can omit the SELECTION parameter if you want to see the multiple regression model that includes all the independent variables. SLENTRY is the significance level for entering a variable into the model, if you're using FORWARD or STEPWISE selection; in this example, a variable must have a P-value less than 0.15 to be entered into the regression model. SLSTAY is the significance level for removing a variable in BACKWARD or STEPWISE selection; in this example, a variable with a P-value greater than 0.15 will be removed from the model. DETAILS=SUMMARY produces a shorter output file; you can omit it to see more details on each step of the variable selection process. The STB option causes the standard partial regression coefficients to be displayed.

Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 acreage 1 0.1201 0.1201 14.2427 9.01 0.0038 2 no3 2 0.1193 0.2394 5.6324 10.20 0.0022 3 maxdepth 3 0.0404 0.2798 4.0370 3.59 0.0625

The summary shows that acreage was added to the model first, yielding an R^{2} of 0.1201. Next, no3 was added. The R^{2} increased to 0.2394, and the increase in R^{2} was significant (P=0.0022). Next, maxdepth was added. The R^{2} increased to 0.2798, which was not quite significant (P=0.0625); SLSTAY was set to 0.15, not 0.05, because you might want to include this variable in a predictive model even if it's not quite significant. None of the other variables increased R^{2} enough to have a P-value less than 0.15, and removing any of the variables caused a decrease in R^{2} big enough that P was less than 0.15, so the stepwise process is done.

Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 -23.82907 15.27399 -1.56 0.1237 0 acreage 1 0.00199 0.00067421 2.95 0.0045 0.32581 maxdepth 1 0.33661 0.17757 1.90 0.0625 0.20860 no3 1 8.67304 2.77331 3.13 0.0027 0.33409

The "parameter estimates" are the partial regression coefficients; they show that the model is Y_{exp}=0.00199(acreage)+0.3361(maxdepth)+8.67304(no3)−23.82907. The "standardized estimates" are the standard partial regression coefficients; they show that no3 has the greatest contribution to the model, followed by acreage and then maxdepth. The value of this multiple regression would be that it suggests that the acreage of a stream's watershed is somehow important. Because watershed area wouldn't have any direct effect on the fish in the stream, I would carefully look at the correlations between the acreage and the other independent variables; I would also try to see if there are other variables that were not analyzed that might be both correlated with watershed area and directly important to fish, such as current speed, water clarity, or substrate type.

### Further reading

Sokal and Rohlf, pp. 609-631.

Zar, pp. 413-450.

### References

### ⇐ Previous topic | Next topic ⇒

This page was last revised March 27, 2013. Its address is http://udel.edu/~mcdonald/statmultreg.html. It may be cited as pp. 239-246 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.

©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.