# Step-by-step analysis of biological data

I find that a systematic, step-by-step approach is the best way to analyze biological data. The statistical analysis of a biological experiment may be broken down into the following steps:

- Specify the biological question to be answered.
- Put the question in the form of a biological null hypothesis and alternate hypothesis.
- Put the question in the form of a statistical null hypothesis and alternate hypothesis.
- Determine which variables are relevant to the question.
- Determine what kind of variable each one is.
- Design an experiment that controls or randomizes the confounding variables.
- Based on the number of variables, the kind of variables, the expected fit to the parametric assumptions, and the hypothesis to be tested, choose the best statistical test to use.
- If possible, do a power analysis to determine a good sample size for the experiment.
- Do the experiment.
- Examine the data to see if it meets the assumptions of the statistical test you chose (normality, homoscedasticity, etc.). If it doesn't, choose a more appropriate test.
- Apply the chosen statistical test, and interpret the result.
- Communicate your results effectively, usually with a graph or table.

Here's an example of how this works. Verrelli and Eanes (2001) measured glycogen content in *Drosophila melanogaster* individuals. The flies were polymorphic at the genetic locus that codes for the enzyme phosphoglucomutase (PGM). At site 52 in the PGM protein sequence, flies had either a valine or an alanine. At site 484, they had either a valine or a leucine. All four combinations of amino acids (V-V, V-L, A-V, A-L) were present.

- One biological question is "Do the amino acid polymorphisms at the
*Pgm*locus have an effect on glycogen content?" The biological question is usually something about biological processes, usually in the form "Does X cause Y?" - The biological null hypothesis is "Different amino acid sequences do not affect the biochemical properties of PGM, so glycogen content is not affected by PGM sequence." The biological alternative hypothesis is "Different amino acid sequences do affect the biochemical properties of PGM, so glycogen content is affected by PGM sequence."
- The statistical null hypothesis is "Flies with different sequences of the PGM enzyme have the same average glycogen content." The alternate hypothesis is "Flies with different sequences of PGM have different average glycogen contents." While the biological null and alternative hypotheses are about biological processes, the statistical null and alternative hypotheses are all about the numbers; in this case, the glycogen contents are either the same or different.
- The two relevant variables are glycogen content and PGM sequence.
- Glycogen content is a measurement variable, something that is recorded as a number that could have many possible values. The sequence of PGM that a fly has (V-V, V-L, A-V or A-L) is a nominal variable, something with a small number of possible values (four, in this case) that is usually recorded as a word.
- Other variables that might be important, such as age and where in a vial the fly pupated, were either controlled (flies of all the same age were used) or randomized (flies were taken randomly from the vials without regard to where they pupated).
- Because the goal is to compare the means of one measurement variable among groups classified by one nominal variable, and there are more than two classes, the appropriate statistical test is a Model I one-way anova.
- A power analysis would have required an estimate of the standard deviation of glycogen content, which probably could have been found in the published literature, and a number for the effect size (the variation in glycogen content among genotypes that the experimenters wanted to detect). In this experiment, any difference in glycogen content among genotypes would be interesting, so the experimenters just used as many flies as was practical in the time available.
- The experiment was done: glycogen content was measured in flies with different PGM sequences.
- The anova assumes that the measurement variable, glycogen content, is normal (the distribution fits the bell-shaped normal curve) and homoscedastic (the variances in glycogen content of the different PGM sequences are equal), and inspecting histograms of the data shows that the data fit these assumptions. If the data hadn't met the assumptions of anova, the Kruskal–Wallis test or Welch's test might have been better.
- The one-way anova was done, using a spreadsheet, web page, or computer program, and the result of the anova is a P-value less than 0.05. The interpretation is that flies with some PGM sequences have different average glycogen content than flies with other sequences of PGM.
- The results could be summarized in a table, but a more effective way to communicate them is with a graph:

Glycogen content in Drosophila melanogaster. Each bar represents the mean glycogen content (in micrograms per fly) of 12 flies with the indicated PGM haplotype. Narrow bars represent +/-2 standard errors of the mean. |

### References

Verrelli, B.C., and W.F. Eanes. 2001. The functional impact of PGM amino acid polymorphism on glycogen content in *Drosophila melanogaster.*
Genetics 159: 201-210. (Note that for the purposes of this web page, I've used a different statistical test than Verrelli and Eanes did. They were interested in interactions among the individual amino acid polymorphisms, so they used a two-way anova.)

### ⇐ Previous topic | Next topic ⇒

This page was last revised September 14, 2009. Its address is http://udel.edu/~mcdonald/statstep.html. It may be cited as pp. 4-6 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.

©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.