When to use it
You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables. The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the X variables. There is also a"hidden" nominal variable that groups the measurement variables together.
Multiple regression for prediction
One use of multiple regression is prediction or estimation of an unknown Y value corresponding to a set of X values. For example, let's say you're interested in finding suitable habitat to reintroduce the rare beach tiger beetle, Cicindela dorsalis dorsalis, which lives on sandy beaches on the Atlantic coast of North America. You've gone to a number of beaches that already have the beetles and measured the density of tiger beetles (the dependent variable) and several biotic and abiotic factors, such as wave exposure, sand particle size, beach steepness, density of amphipods and other prey organisms, etc. Multiple regression would give you an equation that would relate the tiger beetle density to a function of all the other variables. Then if you went to a beach that doesn't have tiger beetles and measured all the independent variables (wave exposure, sand particle size, etc.) you could use the multiple regression equation to predict the density of tiger beetles that could live there if you introduced them.
Multiple regression for understanding causes
A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. For example, if you did a regression of tiger beetle density on sand particle size by itself, you would probably see a significant relationship. If you did a regression of tiger beetle density on wave exposure by itself, you would probably see a significant relationship. However, sand particle size and wave exposure are correlated; beaches with bigger waves tend to have bigger sand particles. Maybe sand particle size is really important, and the correlation between it and wave exposure is the only reason for a significant regression between wave exposure and beetle density. Multiple regression is a statistical way to try to control for this; it can answer questions like "If sand particle size (and every other measured variable) were the same, would the regression of beetle density on wave exposure be significant?"
The main null hypothesis of a multiple regression is that there is no relationship between the X variables and the Y variables, that the fit of the observed Y values to those predicted by the multiple regression equation is no better than what you would expect by chance. As you are doing a multiple regression, there is also a null hypothesis for each X variable, that adding that X variable to the multiple regression does not improve the fit of the multiple regression equation any more than expected by chance.
How it works
The basic idea is that an equation is found, like this:
The Yexp is the expected value of Y for a given set of X values. b1 is the estimated slope of a regression of Y on X1, if all of the other X variables could be kept constant, and so on for b2, b3, etc; a is the intercept. Values of b1, etc. (the "partial regression coefficients") and the intercept are found that minimize the squared deviations between the expected and observed values of Y.
How well the equation fits the data is expressed by R2, the "coefficient of multiple determination." This can range from 0 (for no relationship between the X and Y variables) to 1 (for a perfect fit, no difference between the observed and expected Y values). The P-value is a function of the R2, the number of observations, and the number of X variables.
When the purpose of multiple regression is prediction, the important result is an equation containing partial regression coefficients. If you had the partial regression coefficients and measured the X variables, you could plug them into the equation and predict the corresponding value of Y. The magnitude of the partial regression coefficient depends on the unit used for each variable, so it does not tell you anything about the relative importance of each variable.
When the purpose of multiple regression is understanding functional relationships, the important result is an equation containing standard partial regression coefficients, like this:
where b'1 is the standard partial regression coefficient of y on X1. It is the number of standard deviations that Y would change for every one standard deviation change in X1, if all the other X variables could be kept constant. The magnitude of the standard partial regression coefficients tells you something about the relative importance of different variables; X variables with bigger standard partial regression coefficients have a stronger relationship with the Y variable.
Selecting variables in multiple regression
Every time you add a variable to a multiple regression, the R2 increases (unless the variable is a simple linear function of one of the other variables, in which case R2 will stay the same). The best-fitting model is therefore the one that includes all of the X variables. However, whether the purpose of a multiple regression is prediction or understanding functional relationships, it is often useful to decide which are important and unimportant variables. In the tiger beetle example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only sand particle size and amphipod density, rather than measuring a dozen difficult variables. If your purpose was understanding possible causes, knowing that certain variables did not explain much of the variation in tiger beetle density could suggest that they are probably not important causes of the variation in beetle density.
One way to choose variables, called forward selection, is to do a linear regression for each of the X variables, one at a time, then pick the X variable that had the highest R2. Next you do a multiple regression with the X variable from step 1 and each of the other X variables. The X variable that increases the R2 by the greatest amount is added, if the P-value of the increase in R2 is below the desired cutoff. This procedure continues until adding another X variable does not significantly increase the R2.
To calculate the P-value of an increase in R2 when increasing the number of X variables from d to e, where the total sample size is n, use the formula:
Fs=(R2e−R2d)/(e−d) —————————————— (1−R2e)/(n−e−1)
A second technique, called backward elimination, is to start with a multiple regression using all of the X variables, then perform multiple regressions with each X variable removed in turn. The X variable whose removal causes the smallest decrease in R2 is eliminated. This process continues until removal of any X variable would cause a significant decrease in R2.
Odd things can happen when using either of the above techniques. You could add variables X1, X2, X3, and X4, with a significant increase in R2 at each step, then find that once you've added X3 and X4, you can remove X1 with little decrease in R2. It is possible to do multiple regression with independent variables A, B, C, and D, and have forward selection choose variables A and B, and backward elimination choose variables C and D. To avoid this, many people use stepwise multiple regression. After adding each X variable, the effects of removing any of the other X variables is tested. This continues until adding new X variables does not significantly increase R2 and removing X variables does not significantly decrease it.
It is easy to throw a big data set at a multiple regression and get an impressive-looking output. However, many people are skeptical of the usefulness of multiple regression, especially for variable selection, and you should view the results with caution. You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the biology. You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing.
If independent variables A and B are both correlated with Y, and A and B are highly correlated with each other, only one may contribute significantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no biological importance.For example, let's say you did a multiple regression on vertical leap in children five to 12 years old, with height, weight, age and score on a reading test as independent variables. All four independent variables are highly correlated in children, since older children are taller, heavier and read better, so it's possible that once you've added weight and age to the model, there is so little variation left that the effect of height is not significant. It would be biologically silly to conclude that height had no influence on vertical leap. Because reading ability is correlated with age, it's possible that it would contribute significantly to the model; that might suggest some interesting followup experiments on children all of the same age, but it would be unwise to conclude that there was a real effect of reading ability and vertical leap based solely on the multiple regression.
I extracted some data from the Maryland Biological Stream Survey to practice multiple regression on; the data are shown below in the SAS example. The dependent variable is the number of longnose dace (Rhinichthys cataractae) per 75-meter section of stream. The independent variables are the area (in acres) drained by the stream; the dissolved oxygen (in mg/liter); the maximum depth (in cm) of the 75-meter segment of stream; nitrate concetration (mg/liter); sulfate concentration (mg/liter); and the water temperature on the sampling date (in degrees C).
One biological goal might be to measure the physical and chemical characteristics of a stream and be able to predict the abundance of longnose dace; another goal might be to generate hypotheses about the causes of variation in longnose dace abundance.
The results of a stepwise multiple regression, with P-to-enter and P-to-leave both equal to 0.15, is that acreage, nitrate, and maximum depth contribute to the multiple regression equation. The R2 of the model including these three terms is 0.28, which isn't very high.
Graphing the results
If the multiple regression equation ends up with only two independent variables, you might be able to draw a three-dimensional graph of the relationship. Because most humans have a hard time visualizing four or more dimensions, there's no good visual way to summarize all the information in a multiple regression with three or more independent variables. It could be useful to plot a scattergraph with the predicted values on the X-axis and the observed values on the Y-axis. For the longnose dace, I set up a spreadsheet with acreage in column C, maximum depth in column E, and nitrate in column F. Then I put the following equation (in Excel format) in column J, row 2, and copied it into each cell in column J:
|Observed abundance of longnose dace vs. the abundance predicted from the multiple regression equation.|
If the multiple regression were perfect, the points would fall on the diagonal dashed line; I made the graph square, with the same scale on the X and Y axis, to emphasize this. The graph makes it easy to see that the multiple regression equation doesn't do a very good job of predicting longnose dace abundance; either other factors that haven't been included in the model are important, or there's a lot of stochasticity in longnose dace abundance.
There are dozens of other multivariate statistical techniques that have been developed, and picking the most appropriate one for your experiment, and interpreting the results, can be difficult. My goal here is mainly to help you understand the results of the most common technique, multiple regression; if you want to actually use multivariate techniques, you're going to have to do a lot of reading in more specialized texts and consult with experts.
How to do multiple regression
If you're serious about doing multiple regressions as part of your research, you're going to have to learn a specialized statistical program such as SAS or SPSS. I've written a spreadsheet that will enable you to do a multiple regression with up to 12 X variables and up to 1000 observations. It's fun to play with, but I'm not confident enough in it that I'd recommend using it for publishable results. The spreadsheet includes histograms to help you decide whether to transform your variables, and scattergraphs of the Y variable vs. each X variable so you can see if there are any non-linear relationships. It doesn't do variable selection automatically, you manually choose which variables to include.
The spreadsheet had an error that would have given incorrect results for data sets with more than 100 observations. I have corrected the spreadsheet on March 27, 2013. Thanks to Linus Johnson for pointing out the error.
You use PROC REG to do multiple regression in SAS. Here is an example using the data on longnose dace abundance described above.
data fish; var stream $ longnosedace acreage do2 maxdepth no3 so4 temp; cards; BASIN_RUN 13 2528 9.6 80 2.28 16.75 15.3
====See the web page for the full data set====
WATTS_BR 19 510 6.7 82 5.25 14.19 26.5 ; proc reg data=fish; model longnosedace=acreage do2 maxdepth no3 so4 temp / selection=stepwise slentry=0.15 slstay=0.15 details=summary stb; run;
In the MODEL statement, the dependent variable is to the left of the equals sign, and all the independent variables are to the right. SELECTION determines which variable selection method is used; choices include FORWARD, BACKWARD, STEPWISE, and several others. You can omit the SELECTION parameter if you want to see the multiple regression model that includes all the independent variables. SLENTRY is the significance level for entering a variable into the model, if you're using FORWARD or STEPWISE selection; in this example, a variable must have a P-value less than 0.15 to be entered into the regression model. SLSTAY is the significance level for removing a variable in BACKWARD or STEPWISE selection; in this example, a variable with a P-value greater than 0.15 will be removed from the model. DETAILS=SUMMARY produces a shorter output file; you can omit it to see more details on each step of the variable selection process. The STB option causes the standard partial regression coefficients to be displayed.
Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 acreage 1 0.1201 0.1201 14.2427 9.01 0.0038 2 no3 2 0.1193 0.2394 5.6324 10.20 0.0022 3 maxdepth 3 0.0404 0.2798 4.0370 3.59 0.0625
The summary shows that acreage was added to the model first, yielding an R2 of 0.1201. Next, no3 was added. The R2 increased to 0.2394, and the increase in R2 was significant (P=0.0022). Next, maxdepth was added. The R2 increased to 0.2798, which was not quite significant (P=0.0625); SLSTAY was set to 0.15, not 0.05, because you might want to include this variable in a predictive model even if it's not quite significant. None of the other variables increased R2 enough to have a P-value less than 0.15, and removing any of the variables caused a decrease in R2 big enough that P was less than 0.15, so the stepwise process is done.
Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > |t| Estimate Intercept 1 -23.82907 15.27399 -1.56 0.1237 0 acreage 1 0.00199 0.00067421 2.95 0.0045 0.32581 maxdepth 1 0.33661 0.17757 1.90 0.0625 0.20860 no3 1 8.67304 2.77331 3.13 0.0027 0.33409
The "parameter estimates" are the partial regression coefficients; they show that the model is Yexp=0.00199(acreage)+0.3361(maxdepth)+8.67304(no3)−23.82907. The "standardized estimates" are the standard partial regression coefficients; they show that no3 has the greatest contribution to the model, followed by acreage and then maxdepth. The value of this multiple regression would be that it suggests that the acreage of a stream's watershed is somehow important. Because watershed area wouldn't have any direct effect on the fish in the stream, I would carefully look at the correlations between the acreage and the other independent variables; I would also try to see if there are other variables that were not analyzed that might be both correlated with watershed area and directly important to fish, such as current speed, water clarity, or substrate type.
Sokal and Rohlf, pp. 609-631.
Zar, pp. 413-450.
This page was last revised March 27, 2013. Its address is http://udel.edu/~mcdonald/statmultreg.html. It may be cited as pp. 239-246 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.
©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.