# Logistic regression

### When to use it

You use simple logistic regression when you have one nominal variable with two values (male/female, dead/alive, etc.) and one measurement variable. The nominal variable is the dependent variable, and the measurement variable is the independent variable.

Multiple logistic regression is used when the dependent variable is nominal and there is more than one independent variable. It is analogous to multiple linear regression, and all of the same caveats apply. If you're an epidemiologist, you'll probably need to take a whole course on multiple logistic regression; if you're any other kind of biologist, you'll probably never use it. I won't discuss it any more here; if I say "logistic regression," I'm referring to simple logistic regression.

Simple logistic regression is analogous to linear regression, except that the dependent variable is nominal, not a measurement. One goal is to see whether the probability of getting a particular value of the nominal variable is associated with the measurement variable; the other goal is to predict the probability of getting a particular value of the nominal variable, given the measurement variable.

Data with one nominal and one measurement variable can also be analyzed using a one-way anova or a Student's t-test, and the distinction can be subtle. One clue is that logistic regression allows you to predict the probability of the nominal variable. For example, imagine that you had measured the cholesterol level in the blood of a large number of 55-year-old women, then followed up ten years later to see who had had a heart attack. You could do a t-test, comparing the cholesterol levels of the women who did have heart attacks vs. those who didn't, and that would be a perfectly reasonable way to test the null hypothesis that cholesterol level is not associated with heart attacks; if the hypothesis test was all you were interested in, the t-test would probably be better than the less-familiar logistic regression. However, if you wanted to predict the probability that a 55-year-old woman with a particular cholesterol level would have a heart attack in the next ten years, so that doctors could tell their patients "If you reduce your cholesterol by 40 points, you'll reduce your risk of heart attack by X percent," you would have to use logistic regression.

 A Komodo dragon, Varanus komodoensis.

Another situation that calls for logistic regression, rather than an anova or t-test, is when the values of the measurement variable are set by the experimenter, while the values of the nominal variable are free to vary. For example, let's say you are studying the effect of incubation temperature on sex determination in Komodo dragons. You raise 10 eggs at 30 C, 30 eggs at 32 C, 12 eggs at 34 C, etc., then determine the sex of the hatchlings. It would be silly to compare the mean incubation temperatures between male and female hatchlings, and test the difference using an anova or t-test, because the incubation temperature does not depend on the sex of the offspring; you've set the incubation temperature, and if there is a relationship, it's that the sex of the offspring depends on the temperature.

When there are multiple observations of the nominal variable for each value of the measurement variable, as in the Komodo dragon example, you'll often sees the data analyzed using linear regression, with the proportions treated as a second measurement variable. Often the proportions are arc-sine transformed, because that makes the distributions of proportions more normal. This is not horrible, but it's not strictly correct. One problem is that linear regression treats all of the proportions equally, even if they are based on much different sample sizes. If 6 out of 10 Komodo dragon eggs raised at 30 C were female, and 15 out of 30 eggs raised at 32 C were female, the 60% female at 30 C and 50% at 32 C would get equal weight in a linear regression, which is inappropriate. Logistic regression analyzes each observation (in this example, the sex of each Komodo dragon) separately, so the 30 dragons at 32 C would have 3 times the weight of the 10 dragons at 30 C.

It is also possible to do logistic regression with two nominal variables, but to be honest, I don't see the advantage of this over a chi-squared or G-test of independence.

### Null hypothesis

The statistical null hypothesis is that the probability of a particular value of the nominal variable is not associated with the value of the measurement variable; in other words, the line describing the relationship between the measurement variable and the probability of the nominal variable has a slope of zero.

### How the test works

Simple logistic regression finds the equation that best predicts the value of the Y variable for each value of the X variable. What makes logistic regression different from linear regression is that the Y variable is not directly measured; it is instead the probability of obtaining a particular value of a nominal variable. If you were studying people who had heart attacks, the values of the nominal variable would be "did have a heart attack" vs. "didn't have a heart attack." The Y variable used in logistic regression would then be the probability of having a heart attack. This probability could take values from 0 to 1. The limited range of this probability would present problems if used directly in a regression, so the odds, Y/(1-Y), is used instead. (If the probability of a heart attack is 0.25, the odds of a heart attack are 0.25/(1-0.25)=1/3. In gambling terms, this would be expressed as "3 to 1 odds against having a heart attack.") Taking the natural log of the odds makes the variable more suitable for a regression, so the result of a logistic regression is an equation that looks like this:

ln[Y/(1−Y)]=a+bX

The slope (b) and intercept (a) of the best-fitting equation in a logistic regression are found using the maximum-likelihood method, rather than the least-squares method used for linear regression. Maximum likelihood is a computer-intensive technique; the basic idea is that it finds the values of the parameters under which you would be most likely to get the observed results.

There are several different ways of estimating the P-value. The Wald chi-square is fairly popular, but it may yield inaccurate results with small sample sizes. The likelihood ratio method may be better. It uses the difference between the probability of obtaining the observed results under the logistic model and the probability of obtaining the observed results in a model with no relationship between the independent and dependent variables. I recommend you use the likelihood-ratio method; be sure to specify which method you've used when you report your results.

### Examples

 An amphipod crustacean, Megalorchestia californiana.

McDonald (1985) counted allele frequencies at the mannose-6-phosphate isomerase (Mpi) locus in the amphipod crustacean Megalorchestia californiana, which lives on sandy beaches of the Pacific coast of North America. There were two common alleles, Mpi90 and Mpi100. The latitude of each collection location, the count of each of the alleles, and the proportion of the Mpi100 allele, are shown here:

```
location          latitude  Mpi90  Mpi100  p, Mpi100
Port Townsend, WA    48.1      47    139     0.748
Neskowin, OR         45.2     177    241     0.577
Siuslaw R., OR       44.0    1087   1183     0.521
Umpqua R., OR        43.7     187    175     0.483
Coos Bay, OR         43.5     397    671     0.628
San Francisco, CA    37.8      40     14     0.259
Carmel, CA           36.6      39     17     0.304
Santa Barbara, CA    34.3      30      0     0.000

```

Allele (Mpi90 or Mpi100) is the nominal variable, location is the hidden nominal variable, and latitude is the measurement variable. If the biological question were "Do different locations have different allele frequencies?", you would ignore latitude and do a chi-square or G-test of independence; here the biological question is "Are allele frequencies associated with latitude?"

Note that although the proportion of the Mpi100 allele seems to increase with increasing latitude, the sample sizes for the northern and southern areas are pretty small. Doing a logistic regression, the result is chi2=83.3, 1 d.f., P=7×10−20. The equation is

ln(Y/(1−Y))=−7.6469+0.1786(latitude),

where Y is the predicted probability of getting an Mpi100 allele. Solving this for Y gives

Y=e−7.6469+0.1786(lat)/(1+e−7.6469+0.1786(lat)).

This logistic regression line is shown on the graph; note that it has a gentle S-shape.

 Mpi allele frequencies vs. latitude in the amphipod Megalorchestia californiana. Error bars are 95% confidence intervals; the thick black line is the logistic regression line.

Imagine that you have measured antennal stroking speed for some male cucumber beetles. You then present each beetle to a female and see whether mating occurs. Mating would be the nominal variable (mated vs. not mated), and you would do a logistic regression with the probability of mating as the Y variable and anntennal stroking speed as the X variable. The result would tell you whether the stroking speed was significantly associated with the probability of mating.

### Graphing the results

If you have multiple observations for each value of the measurement variable, as in the amphipod example above, you can plot a scattergraph with the measurement variable on the X-axis and the proportions on the Y-axis. You might want to put 95% confidence intervals on the points; this gives a visual indication of which points contribute more to the regression (the ones with larger sample sizes have smaller confidence intervals).

There's no automatic way in spreadsheets to add the logistic regression line. Here's how I got it onto the graph of the amphipod data. First, I put the latitudes in column A and the proportions in column B. Then, using the Fill: Series command, I added numbers 30, 30.1, 30.2,…50 to cells A10 through A210. In column C I entered the equation for the logistic regression line; in Excel format, it's

=exp(-7.6469+0.1786*(A10))/(1+exp(-7.6469+0.1786*(A10)))

for row 10. I copied this into cells C11 through C210. Then when I drew a graph of the numbers in columns A, B, and C, I gave the numbers in column B symbols but no line, and the numbers in column C got a line but no symbols.

 Central stoneroller, Campostoma anomalum.

If you only have one observation of the nominal variable for each value of the measurement variable, it would be silly to draw a scattergraph, as each point on the graph would be at either 0 or 1 on the Y-axis. If you have lots of data points, you can divide the measurement values into intervals and plot the proportion for each interval on a bar graph. Here is data from the Maryland Biological Stream Survey on 2180 sampling sites in Maryland streams. The measurement variable is dissolved oxygen concentration, and the nominal variable is the presence or absence of the central stoneroller, Campostoma anomalum. The intervals are different widths so that each interval includes roughly the same number of stream sites. If you use a bar graph to illustrate a logistic regression, you should explain that the grouping was for heuristic purposes only, and the logistic regression was done on the raw, ungrouped data.

Spreadsheets don't have an option to make bars of different widths, so I graphed these data as a scattergraph. The first bar covers the interval from 0.25 to 5.65, and the proportion is 0.018, so the first four rows on the spreadsheet are:

```
0.25      0
0.25  0.018
5.65  0.018
5.65      0

```

These values are connected with a red line, with no symbols. I got the heavy black line for the logistic regression as described above for the amphipod graph. I used a graphics program to paint the bars gray.

 Proportion of streams with central stonerollers vs. dissolved oxygen. Dissolved oxygen intervals were set to have roughly equal numbers of stream sites. The thick black line is the logistic regression line; it is based on the raw data, not the data grouped into intervals.
 Proportion of streams with central stonerollers vs. dissolved oxygen. Dissolved oxygen intervals were set to have roughly equal numbers of stream sites. The thick black line is the logistic regression line; it is based on the raw data, not the data grouped into intervals.

### Similar tests

It is possible to do logistic regression with a dependent variable that has more than two values, known as a multinomial, polytomous, or polychotomous variable. This subject is not covered here.

Multiple logistic regression is used when the dependent variable is nominal and there is more than one independent variable. It is analogous to multiple linear regression, and all of the same caveats apply.

Linear regression is used when the Y variable is a measurement variable. For example, if you measured the length of time it took for male beetles to make and wanted to relate that to stroking speed, you would use linear regression.

When there is just one measurement variable and one nominal variable, one-way anova or a t-test could also be used to compare the means of the measurement variable between the two groups. Conceptually, the difference is whether you think variation in the nominal variable causes variation in the measurement variable (use a t-test) or variation in the measurement variable causes variation in the probability of the nominal variable (use logistic regression). You should also consider who you are presenting your results to, and how they are going to use the information. For example, if you were only interested in stroking speed and mating success in cucumber beetles, you could do a t-test to compare average stroking speed between mated and unmated males. This would be simpler and more familiar than logistic regression; your conclusion would be something like "The mean stroking speed is 73 strokes per minute in mated males and 64 spm in unmated males, a significant difference." Logistic regression is more difficult and less familiar, but you would be able to express your results with statements such as "A male beetle who strokes a female's antennae at 75 strokes per minute is twice as likely to be accepted by the female as one who strokes 61 strokes per minute." This might be easier to understand, and therefore more useful (especially if you're a male cucumber beetle).

### How to do the test

I have written a spreadsheet to do simple logistic regression. You can enter the data either in summarized form (for example, saying that at 30 C there were 7 male and 3 female Komodo dragons) or non-summarized form (for example, entering each Komodo dragon separately, with "0" for a male and "1" for a female). It uses the likelihood-ratio method for calculating the P-value. The spreadsheet makes use of the "Solver" tool in Excel. If you don't see Solver listed in the Tools menu, go to Add-Ins in the Tools menu and install Solver.

#### Web page

There is a very nice web page that will do logistic regression, with the likelihood-ratio chi-square. You can enter the data either in summarized form or non-summarized form, with the values separated by tabs (which you'll get if you copy and paste from a spreadsheet) or commas. The amphipod data would be entered like this:

```
48.1,47,139
45.2,177,241
44.0,1087,1183
43.7,187,175
43.5,397,671
37.8,40,14
36.6,39,17
34.3,30,0

```

#### SAS

Use PROC LOGISTIC for simple logistic regression. There are two forms of the MODEL statement. When you have multiple observations for each value of the measurement variable, your data set can have the measurement variable, the number of "successes" (this can be either value of the nominal variable), and the total. Here is an example using the amphipod data:

```
data amphipods;
input location \$ latitude mpi90 mpi100;
total=mpi90+mpi100;
cards;
Port_Townsend,_WA    48.1      47    139
Neskowin,_OR         45.2     177    241
Siuslaw_R.,_OR       44.0    1087   1183
Umpqua_R.,_OR        43.7     187    175
Coos_Bay,_OR         43.5     397    671
San_Francisco,_CA    37.8      40     14
Carmel,_CA           36.6      39     17
Santa_Barbara,_CA    34.3      30      0
;
proc logistic data=amphipods;
model mpi100/total=latitude;
run;

```

Note that the new variable TOTAL is created in the DATA step by adding the number of Mpi90 and Mpi100 alleles. The MODEL statement uses the number of Mpi100 alleles out of the total as the dependent variable. The P-value would be the same if you used Mpi90; the equation parameters would be different.

There is a lot of output from PROC LOGISTIC that you don't need. The program gives you three different P-values; the likelihood ratio P-value is the most commonly used:

```
Testing Global Null Hypothesis: BETA=0

Test               Chi-Square     DF   Pr > ChiSq

Likelihood Ratio      83.3007      1       <.0001 P-value
Score                 80.5733      1       <.0001
Wald                  72.0755      1       <.0001

```

The coefficients of the logistic equation are given under "estimate":

```
Analysis of Maximum Likelihood Estimates

Standard         Wald
Parameter   DF   Estimate      Error   Chi-Square   Pr > ChiSq

Intercept    1    -7.6469     0.9249      68.3605       <.0001
latitude     1     0.1786     0.0210      72.0755       <.0001

```

Using these coefficients, the maximum likelihood equation for the proportion of Mpi100 alleles at a particular latitude is

Y=e−7.6469+0.1786(latitude)/(1+e−7.6469+0.1786(latitude))

It is also possible to use data in which each line is a single observation. In that case, you may use either words or numbers for the dependent variable. In this example, the data are height (in inches) of the 2004 students of my class, along with their favorite insect (grouped into beetles vs. everything else, where "everything else" includes spiders, which a biologist really should know are not insects):

```
data insect;
input height insect \$;
cards;
62  beetle
66  other```

===See the web page for the full data set===

```61  beetle
67  other
62  other
76  other
66  other
70  beetle
67  other
66  other
70  other
70  other
77  beetle
76  other
72  beetle
76  beetle
72  other
70  other
65  other
63  other
63  other
70  other
72  other
70  beetle```
```74  other
;
proc logistic data=insect;
model insect=height;
run;

```

The format of the results is the same for either form of the MODEL statement. In this case, the model would be the probability of BEETLE, because it is alphabetically first; to model the probability of OTHER, you would add an EVENT after the nominal variable in the MODEL statement, making it MODEL INSECT (EVENT='OTHER')=HEIGHT;

Sokal and Rohlf, pp. 767-778.

### References

Picture of Komodo dragon from National Zoo Asian Animals Gallery.

Picture of fish from Nature Trivia: Four Families of Fishes in Arkansas, by Renn Tumlison.

McDonald, J.H. 1985. Size-related and geographic variation at two enzyme loci in Megalorchestia californiana (Amphipoda: Talitridae). Heredity 54: 359-366.