# Polynomial regression

### When to use it

Sometimes, when you analyze data with correlation and linear regression, you notice that the relationship between the independent (X) variable and dependent (Y) variable looks like it follows a curved line, not a straight line. In that case, the linear regression line will not be very good for describing and predicting the relationship, and the P-value may not be an accurate test of the hypothesis that the variables are not associated.

Your first choice when faced with a curved relationship between two measurement variables should be to try data transformations on one or both of the variables. Often, this will straighten out a simple J-shaped curve. If that doesn't work, you can try curvilinear regression, in which a more complicated equation than the linear regression equation is fit to the data. Equations with a variety of terms will produce curved lines, including exponential (involving b^{X}, where *b* is a constant), power (involving X^{b}), logarithmic (involving log(X)), and trigonometric (involving sine, cosine, or or other trigonometric functions). For any particular form of equation involving such terms, it is possible to find the equation for the curved line that best fits the data points, and to compare the fit of the more complicated equation to that of a simpler equation (such as the equation for a straight line).

Here I will use polynomial regression as one example of curvilinear regression. A polynomial equation has X raised to integer powers such as X^{2} and X^{3}. A quadratic equation has the form Y=a+b_{1}X+b_{2}X^{2}, where *a* is the Y-intercept and b_{1} and b_{2} are constants. It produces a parabola. A cubic equation has the form Y=a+b_{1}X+b_{2}X^{2}+b_{3}X^{3} and produces an S-shaped curve, while a quartic equation has the form Y=a+b_{1}X+b_{2}X^{2}+b_{3}X^{3}+b_{4}X^{4} and can produce M or W shaped curves. You can fit higher-order polynomial equations, but it is very unlikely that you would want to use anything more than the cubic in biology.

### Null hypotheses

Several null hypotheses are tested while doing polynomial regression. The first null hypothesis is that a quadratic equation does not fit the data significantly better than a linear equation; the next null hypothesis may be that a cubic equation does not fit the data significantly better than a quadratic equation, and so on. There is also a null hypothesis for each equation that says that it does not fit the data significantly better than a horizontal line; in other words, that there is no relationship between the X and Y variables.

### How the test works

In polynomial regression,
different powers of the X variable (X, X^{2}, X^{3}…)
are added to an equation to see whether they increase the r^{2}
significantly. First a linear regression is done, fitting an equation of
the form Y=a+bX to the data. Then an equation of the form
Y=a+b_{1}X+b_{2}X^{2}, which produces a parabola, is fit to the data. The
r^{2} will always increase when you add a higher-order term, but
the question is whether the increase in r^{2} is significantly
greater than expected due to chance. Next, an equation of the form
Y=a+b_{1}X+b_{2}X^{2}+b_{3}X^{3}, which produces an S-shaped line, is
fit and the increase in r^{2} is tested. This can continue until
adding another term does not increase r^{2} significantly,
although in most cases it is hard to imagine a biological meaning for
exponents greater than 3. Once the best-fitting equation is chosen, it is
tested to see whether it fits the data significantly better than an
equation of the form Y=a; in other words, a horizontal line.

Even though the usual procedure is to test the linear regression first, then the quadratic, then the cubic, you don't need to stop if one of these is not significant. For example, if the graph looks U-shaped, the linear regression may not be significant, but the quadratic will be.

### Examples

Fernandez-Juricic et al. (2003)
examined the effect of human disturbance on the nesting of house sparrows
(*Passer domesticus*). They counted breeding sparrows per hectare in
18 parks in Madrid, Spain, and also counted the number of people per
minute walking through each park (both measurement variables); the identity of the park is the hidden nominal variable.

Graph of sparrow abundance vs. human disturbance with linear regression line. |

The linear regression is not significant
(r^{2}=0.174, 16 d.f., P=0.08).

Graph of sparrow abundance vs. human disturbance with quadratic regression line. |

The quadratic regression is
significant (r^{2}=0.372, 15 d.f., P=0.03), and it is
significantly better than the linear regression (P=0.03). This seems biologically plausible; the data suggest that there is some intermediate level of human traffic that is best for house sparrows. Perhaps areas with too many humans scare the sparrows away, while areas with too few humans favor other birds that outcompete the sparrows for nest sites or something.

Graph of sparrow abundance vs. human disturbance with cubic regression line. |

The cubic graph is
significant (r^{2}=0.765, 14 d.f., P=0.0001), and the increase in
r^{2} between the cubic and the quadratic equation is highly
significant (P=1×10^{−5}). The cubic equation is

Y=0.0443x^{3}−2.916x^{2}+50.601x−87.765.

The
quartic equation does not fit significantly better than the cubic equation
(P=0.80). Even though the cubic equation fits significantly better than the quadratic, it's more difficult to imagine a plausible biological explanation for this. I'd want to see more samples from areas with more than 35 people per hectare per minute before I accepted that the sparrow abundance really starts to increase again above that level of pedestrian traffic.

Ashton et al. (2007) measured the carapace length (in mm) of 18 female gopher tortoises (*Gopherus polyphemus*) in Okeeheelee County Park, Florida, and X-rayed them to count the number of eggs in each. The data are shown below in the SAS example. The linear regression is not significant (r^{2}=0.015, 16 d.f., P=0.63), but the quadratic is significant (r^{2}=0.43, 15 d.f., P=0.014). The increase in r^{2} from linear to quadratic is significant (P= 0.001). The best-fit quadratic equation is Y=−899.9+5.857X−0.009425X^{2}. Adding the cubic and quartic terms does not significantly increase the r^{2}.

The first part of the graph is not surprising; it's easy to imagine why bigger tortoises would have more eggs. The decline in egg number above 310 mm carapace length is the interesting result; it suggests that egg production declines in these tortoises as they get old and big.

Graph of clutch size (number of eggs) vs. carapace length, with best-fit quadratic line. |

### Graphing the results

As shown above, you graph a polynomial regression the same way you would a linear regression, a scattergraph with the independent variable on the X-axis and the dependent variable on the Y-axis. In general, you shouldn't show the regression line for values outside the range of observed X values, as extrapolation with polynomial regression is even more likely than linear regression to yield ridiculous results. For example, extrapolating the quadratic equation relating tortoise carapace length and number of eggs predicts that tortoises with carapace length less than 279 mm or greater than 343 mm would have negative numbers of eggs.

### Similar tests

Before performing a polynomial regression, you should try different transformations when faced with an obviously curved relationship between an X and a Y variable. A linear equation relating transformed variables is simpler and more elegant than a curvilinear equation relating untransformed variables. You should also remind yourself of your reason for doing a regression. If your purpose is prediction of unknown values of Y corresponding to known values of X, then you need an equation that fits the data points well, and a polynomial regression may be appropriate if transformations do not work. However, if your purpose is testing the null hypothesis that there is no relationship between X and Y, and a linear regression gives a significant result, you may want to stick with the linear regression even if polynomial gives a significantly better fit. Using a less-familiar technique that yields a more-complicated equation may cause your readers to be a bit suspicious of your results; they may feel you went fishing around for a statistical test that supported your hypothesis, especially if there's no obvious biological reason for an equation with terms containing exponents.

Spearman rank correlation is a nonparametric test of the association between two variables. It will work well if there is a steady increase or decrease in Y as X increases, but not if Y goes up and then goes down.

Polynomial regression is a form of multiple regression.
In multiple regression, there is one dependent (Y) variable and multiple
independent (X) variables, and the X variables (X_{1},
X_{2}, X_{3}...) are added to the equation to see whether they increase
the R^{2} significantly. In polynomial regression, the independent "variables" are just X, X^{2}, X^{3}, etc.

### How to do the test

#### Spreadsheet

I have prepared a spreadsheet that will
help you perform a polynomial regression. It tests equations up to fourth
order, and it will handle up to 1000
observations. **An earlier version of the spreadsheet gave an incorrect P-value for the quartic equation; it was corrected on Dec. 12, 2012. Thanks to Ewen McLaughlin for catching this.**

#### Web pages

There is a very powerful web page that will fit just about any equation you can think of to your data (not just polynomial). Another web page that will fit any of 15 commonly used equations is here; it is easier to use, and even draws a graph. This web page only does polynomial regression, but is very fast and easy to use.

#### SAS

To do polynomial regression in SAS, you create a data set containing the square of the independent variable, the cube, etc. You then use PROC REG for models containing the higher-order variables. It's possible to do this as a multiple regression, but I think it's less confusing to use multiple model statements, adding one term to each model. There doesn't seem to be an easy way to test the significance of the increase in r^{2} in SAS, so you'll have to do that by hand. If r^{2}_{i} is the r^{2} for the i_{th} order, and r^{2}_{j} is the r^{2} for the next higher order, and d.f._{j} is the degrees of freedom for the higher-order equation, the F-statistic is d.f._{j}×(r^{2}_{j}−r^{2}_{i})/(1−r^{2}_{j}). It has *j* degrees of freedom in the numerator and d.f._{j}=n−j−1 degrees of freedom in the denominator.

Here's an example, using the data on tortoise carapace length and clutch size from Ashton et al. (2007).

data turtles; input length clutch; cards; 284 3 290 2 290 7 290 7 298 11 299 12 302 10 306 8 306 8 309 9 310 10 311 13 317 7 317 9 320 6 323 13 334 2 334 8 ; data turtlepower; set turtles; length2=length*length; length3=length*length*length; length4=length*length*length*length; proc reg data=turtlepower; model clutch=length; model clutch=length length2; model clutch=length length2 length3; run;

In the output, first look for the r^{2} values under each model:

The REG Procedure Model: MODEL1 Dependent Variable: clutch . . . Root MSE 3.41094 R-Square 0.0148linear r-sqDependent Mean 8.05556 Adj R-Sq -0.0468 Coeff Var 42.34268 . . . The REG Procedure Model: MODEL2 Dependent Variable: clutch . . . Root MSE 2.67050 R-Square 0.4338quadratic r-sqDependent Mean 8.05556 Adj R-Sq 0.3583 Coeff Var 33.15104

For this example, n=18. The F-statistic for the increase in r^{2} from linear to quadratic is 15×(0.4338−0.0148)/(1−0.4338)=11.10 with d.f.=2, 15. Using a spreadsheet (enter =FDIST(11.10, 2, 15)) or an online F-statistic calculator, this gives a P-value of 0.0011.

Once you've figured out which equation is best (the quadratic, for our example, since the cubic and quartic equations do not significantly increase the r^{2}), look for the parameters in the output:

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -899.93459 270.29576 -3.33 0.0046 length 1 5.85716 1.75010 3.35 0.0044 length2 1 -0.00942 0.00283 -3.33 0.0045

This tells you that the equation for the best-fit quadratic curve is Y=−899.9+5.857X−0.00942X^{2}.

### Further reading

Sokal and Rohlf, pp. 665-670.

Zar, pp. 452-459.

### References

Ashton, K.G., R.L. Burke, and J.N. Layne. 2007. Geographic variation in body and clutch size of gopher tortoises. Copeia 2007: 355-363.

Fernandez-Juricic, E., A. Sallent, R. Sanz, and I. Rodriguez-Prieto. 2003. Testing the risk-disturbance hypothesis in a fragmented landscape: non-linear responses of house sparrows to humans. Condor 105: 316-326.

### ⇐ Previous topic | Next topic ⇒

This page was last revised December 12, 2012. Its address is http://udel.edu/~mcdonald/statcurvreg.html. It may be cited as pp. 224-231 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.

©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.