Basics

Tests for nominal variables

Descriptive statistics

Tests for one measurement variable

Tests for multiple measurement variables

Multiple tests

Miscellany

Polynomial regression


When to use it

Sometimes, when you analyze data with correlation and linear regression, you notice that the relationship between the independent (X) variable and dependent (Y) variable looks like it follows a curved line, not a straight line. In that case, the linear regression line will not be very good for describing and predicting the relationship, and the P-value may not be an accurate test of the hypothesis that the variables are not associated.

Your first choice when faced with a curved relationship between two measurement variables should be to try data transformations on one or both of the variables. Often, this will straighten out a simple J-shaped curve. If that doesn't work, you can try curvilinear regression, in which a more complicated equation than the linear regression equation is fit to the data. Equations with a variety of terms will produce curved lines, including exponential (involving bX, where b is a constant), power (involving Xb), logarithmic (involving log(X)), and trigonometric (involving sine, cosine, or or other trigonometric functions). For any particular form of equation involving such terms, it is possible to find the equation for the curved line that best fits the data points, and to compare the fit of the more complicated equation to that of a simpler equation (such as the equation for a straight line).

Here I will use polynomial regression as one example of curvilinear regression. A polynomial equation has X raised to integer powers such as X2 and X3. A quadratic equation has the form Y=a+b1X+b2X2, where a is the Y-intercept and b1 and b2 are constants. It produces a parabola. A cubic equation has the form Y=a+b1X+b2X2+b3X3 and produces an S-shaped curve, while a quartic equation has the form Y=a+b1X+b2X2+b3X3+b4X4 and can produce M or W shaped curves. You can fit higher-order polynomial equations, but it is very unlikely that you would want to use anything more than the cubic in biology.

Null hypotheses

Several null hypotheses are tested while doing polynomial regression. The first null hypothesis is that a quadratic equation does not fit the data significantly better than a linear equation; the next null hypothesis may be that a cubic equation does not fit the data significantly better than a quadratic equation, and so on. There is also a null hypothesis for each equation that says that it does not fit the data significantly better than a horizontal line; in other words, that there is no relationship between the X and Y variables.

How the test works

In polynomial regression, different powers of the X variable (X, X2, X3…) are added to an equation to see whether they increase the r2 significantly. First a linear regression is done, fitting an equation of the form Y=a+bX to the data. Then an equation of the form Y=a+b1X+b2X2, which produces a parabola, is fit to the data. The r2 will always increase when you add a higher-order term, but the question is whether the increase in r2 is significantly greater than expected due to chance. Next, an equation of the form Y=a+b1X+b2X2+b3X3, which produces an S-shaped line, is fit and the increase in r2 is tested. This can continue until adding another term does not increase r2 significantly, although in most cases it is hard to imagine a biological meaning for exponents greater than 3. Once the best-fitting equation is chosen, it is tested to see whether it fits the data significantly better than an equation of the form Y=a; in other words, a horizontal line.

Even though the usual procedure is to test the linear regression first, then the quadratic, then the cubic, you don't need to stop if one of these is not significant. For example, if the graph looks U-shaped, the linear regression may not be significant, but the quadratic will be.

Examples

Fernandez-Juricic et al. (2003) examined the effect of human disturbance on the nesting of house sparrows (Passer domesticus). They counted breeding sparrows per hectare in 18 parks in Madrid, Spain, and also counted the number of people per minute walking through each park (both measurement variables); the identity of the park is the hidden nominal variable.


Linear regression graph
Graph of sparrow abundance vs. human disturbance with linear regression line.

The linear regression is not significant (r2=0.174, 16 d.f., P=0.08).



Quadratic regression graph
Graph of sparrow abundance vs. human disturbance with quadratic regression line.

The quadratic regression is significant (r2=0.372, 15 d.f., P=0.03), and it is significantly better than the linear regression (P=0.03). This seems biologically plausible; the data suggest that there is some intermediate level of human traffic that is best for house sparrows. Perhaps areas with too many humans scare the sparrows away, while areas with too few humans favor other birds that outcompete the sparrows for nest sites or something.


Cubic regression graph
Graph of sparrow abundance vs. human disturbance with cubic regression line.

The cubic graph is significant (r2=0.765, 14 d.f., P=0.0001), and the increase in r2 between the cubic and the quadratic equation is highly significant (P=1×10−5). The cubic equation is
Y=0.0443x3−2.916x2+50.601x−87.765.
The quartic equation does not fit significantly better than the cubic equation (P=0.80). Even though the cubic equation fits significantly better than the quadratic, it's more difficult to imagine a plausible biological explanation for this. I'd want to see more samples from areas with more than 35 people per hectare per minute before I accepted that the sparrow abundance really starts to increase again above that level of pedestrian traffic.


Ashton et al. (2007) measured the carapace length (in mm) of 18 female gopher tortoises (Gopherus polyphemus) in Okeeheelee County Park, Florida, and X-rayed them to count the number of eggs in each. The data are shown below in the SAS example. The linear regression is not significant (r2=0.015, 16 d.f., P=0.63), but the quadratic is significant (r2=0.43, 15 d.f., P=0.014). The increase in r2 from linear to quadratic is significant (P= 0.001). The best-fit quadratic equation is Y=−899.9+5.857X−0.009425X2. Adding the cubic and quartic terms does not significantly increase the r2.

The first part of the graph is not surprising; it's easy to imagine why bigger tortoises would have more eggs. The decline in egg number above 310 mm carapace length is the interesting result; it suggests that egg production declines in these tortoises as they get old and big.

X-ray with eggs
X-ray of a tortoise, showing eggs.
Quadratic graph
Graph of clutch size (number of eggs) vs. carapace length, with best-fit quadratic line.

Graphing the results

As shown above, you graph a polynomial regression the same way you would a linear regression, a scattergraph with the independent variable on the X-axis and the dependent variable on the Y-axis. In general, you shouldn't show the regression line for values outside the range of observed X values, as extrapolation with polynomial regression is even more likely than linear regression to yield ridiculous results. For example, extrapolating the quadratic equation relating tortoise carapace length and number of eggs predicts that tortoises with carapace length less than 279 mm or greater than 343 mm would have negative numbers of eggs.

Similar tests

Before performing a polynomial regression, you should try different transformations when faced with an obviously curved relationship between an X and a Y variable. A linear equation relating transformed variables is simpler and more elegant than a curvilinear equation relating untransformed variables. You should also remind yourself of your reason for doing a regression. If your purpose is prediction of unknown values of Y corresponding to known values of X, then you need an equation that fits the data points well, and a polynomial regression may be appropriate if transformations do not work. However, if your purpose is testing the null hypothesis that there is no relationship between X and Y, and a linear regression gives a significant result, you may want to stick with the linear regression even if polynomial gives a significantly better fit. Using a less-familiar technique that yields a more-complicated equation may cause your readers to be a bit suspicious of your results; they may feel you went fishing around for a statistical test that supported your hypothesis, especially if there's no obvious biological reason for an equation with terms containing exponents.

Spearman rank correlation is a nonparametric test of the association between two variables. It will work well if there is a steady increase or decrease in Y as X increases, but not if Y goes up and then goes down.

Polynomial regression is a form of multiple regression. In multiple regression, there is one dependent (Y) variable and multiple independent (X) variables, and the X variables (X1, X2, X3...) are added to the equation to see whether they increase the R2 significantly. In polynomial regression, the independent "variables" are just X, X2, X3, etc.

How to do the test

Spreadsheet

I have prepared a spreadsheet that will help you perform a polynomial regression. It tests equations up to fourth order, and it will handle up to 1000 observations. An earlier version of the spreadsheet gave an incorrect P-value for the quartic equation; it was corrected on Dec. 12, 2012. Thanks to Ewen McLaughlin for catching this.

Web pages

There is a very powerful web page that will fit just about any equation you can think of to your data (not just polynomial). Another web page that will fit any of 15 commonly used equations is here; it is easier to use, and even draws a graph. This web page only does polynomial regression, but is very fast and easy to use.

SAS

To do polynomial regression in SAS, you create a data set containing the square of the independent variable, the cube, etc. You then use PROC REG for models containing the higher-order variables. It's possible to do this as a multiple regression, but I think it's less confusing to use multiple model statements, adding one term to each model. There doesn't seem to be an easy way to test the significance of the increase in r2 in SAS, so you'll have to do that by hand. If r2i is the r2 for the ith order, and r2j is the r2 for the next higher order, and d.f.j is the degrees of freedom for the higher-order equation, the F-statistic is d.f.j×(r2j−r2i)/(1−r2j). It has j degrees of freedom in the numerator and d.f.j=n−j−1 degrees of freedom in the denominator.

Here's an example, using the data on tortoise carapace length and clutch size from Ashton et al. (2007).


data turtles;
   input length clutch;
   cards;
284     3
290     2
290     7
290     7
298     11
299     12
302     10
306     8
306     8
309     9
310     10
311     13
317     7
317     9
320     6
323     13
334     2
334     8
;
data turtlepower; set turtles;
   length2=length*length;
   length3=length*length*length;
   length4=length*length*length*length;
proc reg data=turtlepower;
   model clutch=length;
   model clutch=length length2;
   model clutch=length length2 length3;
   run;

In the output, first look for the r2 values under each model:


                    The REG Procedure
                     Model: MODEL1
                 Dependent Variable: clutch
                         .
                         .
                         .
Root MSE              3.41094    R-Square    0.0148 linear r-sq
Dependent Mean        8.05556    Adj R-Sq   -0.0468
Coeff Var            42.34268
                         .
                         .
                         .
                   The REG Procedure
                      Model: MODEL2
                Dependent Variable: clutch
                         .
                         .
                         .
Root MSE              2.67050    R-Square    0.4338 quadratic r-sq
Dependent Mean        8.05556    Adj R-Sq    0.3583
Coeff Var            33.15104

For this example, n=18. The F-statistic for the increase in r2 from linear to quadratic is 15×(0.4338−0.0148)/(1−0.4338)=11.10 with d.f.=2, 15. Using a spreadsheet (enter =FDIST(11.10, 2, 15)) or an online F-statistic calculator, this gives a P-value of 0.0011.

Once you've figured out which equation is best (the quadratic, for our example, since the cubic and quartic equations do not significantly increase the r2), look for the parameters in the output:


                      Parameter Estimates
 
                    Parameter     Standard
Variable     DF      Estimate        Error   t Value   Pr > |t|

Intercept     1    -899.93459    270.29576     -3.33     0.0046
length        1       5.85716      1.75010      3.35     0.0044
length2       1      -0.00942      0.00283     -3.33     0.0045

This tells you that the equation for the best-fit quadratic curve is Y=−899.9+5.857X−0.00942X2.

Further reading

Sokal and Rohlf, pp. 665-670.

Zar, pp. 452-459.

References

X-ray of a tortoise from The Tortoise Shop.

Ashton, K.G., R.L. Burke, and J.N. Layne. 2007. Geographic variation in body and clutch size of gopher tortoises. Copeia 2007: 355-363.

Fernandez-Juricic, E., A. Sallent, R. Sanz, and I. Rodriguez-Prieto. 2003. Testing the risk-disturbance hypothesis in a fragmented landscape: non-linear responses of house sparrows to humans. Condor 105: 316-326.



Return to the Biological Data Analysis syllabus

Return to John McDonald's home page

This page was last revised December 12, 2012. Its address is http://udel.edu/~mcdonald/statcurvreg.html. It may be cited as pp. 224-231 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.

©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.