# Using spreadsheets for statistics

Real statisticians may sneer, but if you're like most biologists, you can do all of your statistics with spreadsheets. You may spend months getting the most technologically sophisticated new biological techniques to work, but in the end your data can be analzyed with a simple chi-squared test, t-test or linear regression. The graphing abilities of spreadsheets make it easy to inspect data for errors and outliers, look for non-linear relationships and non-normal distributions, and display your final results. Even if you're going to use something like SAS or SPSS, there will be many times when it's easier to enter your data into a spreadsheet first, inspect it for errors, sort and arrange it, then export it into a format suitable for your fancy-schmancy statistics package.

The dominant spreadsheet program by far is Excel, part of the Microsoft Office package, available for Windows or Mac. If your computer already has Excel on it, there's no real advantage to trying anything else. Most of the spreadsheets on these web pages were written with a ten-year-old version of Excel, and the instructions on this web page were based on it; if you're using a newer version and notice that something doesn't work right, please drop me a line and tell me about it.

If your computer doesn't have Excel on it, you could use Calc, part of the free, open-source OpenOffice.org package. Calc does almost everything that Excel does, with just enough exceptions to be annoying. Calc will open Excel files and can save files in Excel format. The OpenOffice.org package is available for Windows, Mac, and Linux; Mac users may want to use the NeoOffice port, which looks and feels more like a regular Mac application (but is rather slow). OpenOffice.org also includes a word processor (like Word) and presentation software (like PowerPoint).

For Linux users, Gnumeric sounds like a good, free, open-source spreadsheet program. Because it's a separate program, rather than part of a large office suite, it should be faster than Calc. I haven't used it, so I don't know how well my spreadsheets will work with it.

The instructions on this web page apply to both Calc and Excel, unless otherwise noted.

### Basic spreadsheet tasks

I'm going to assume you know how to enter data into a spreadsheet, copy and paste, insert and delete rows and columns, and other simple tasks. If you're a complete beginner, you may want to look at tutorials on using Excel here, here, or here. Here are a few other things that will be useful for handling data:

#### Separate text into columns

**Excel: **When you copy columns of data from a web page or text document, then paste them into an Excel spreadsheet, all the data will be in one column. To put the data into multiple columns, choose "Text to columns..." from the Data menu. If you choose "Delimited," you can tell it that the columns are separated by spaces, commas, or some other character. Check the "Treat consecutive delimiters as one" box if numbers may be separated by more than one space, more than one tab, etc. If you choose "Fixed width," you can do things like tell it that the first 10 characters go in column 1, the next 7 characters go in column 2, and so on. The data will be entered into the columns to the right of the original column, so make sure they're empty.

If you paste more text into the same spreadsheet, it will automatically be separated into columns using the same delimiters. If you want to turn this off, select the column where you want to paste the data, choose "Text to columns..." from the Data menu, and choose "Delimited." Then unclick all the boxes for delimiters (spaces, commas, etc.) and click "Finish." Now paste your data into the column.

**Calc**: In Calc, when you paste columns of data from a text document or web page into a spreadsheet, you'll get a Text Import window that asks you how to divide the text up. Click on the characters that separate the columns (usually spaces, tabs, or commas) and click on "Merge Delimiters" if columns may be separated by more than one space or tab.

#### Series fill

This is most often used for numbering a bunch of rows. Let's say you have data in cells B1 through E100, and you want to number the rows 1 through 100. Numbering them will help you keep track of which row is which, and it will be especially useful if you want to sort the data, then put them back in their original order. Put a "1" in cell A1, select cells A1-A100, choose "Fill: Series..." from the Edit menu, and you'll put the numbers 1 through 100 in the cells.

#### Sorting

To sort a bunch of data, select the cells and choose "Sort" from the Data menu. You can sort by up to three columns; for example, you could sort data on a bunch of chickens by "Breed" in column A, "Sex" in column B, and "Weight" in column C, and it would sort the data by breeds, then within each breed have all the females first and then all the males, and within each breed/sex combination the chickens would be listed from smallest to largest.

If you've entered a bunch of data, it's a good idea to sort each column of numbers and look at the smallest and largest values. This may help you spot numbers with misplaced decimal points and other egregious typing errors, as they'll be much larger or much smaller than the correct numbers.

#### Graphing

See the web pages on graphing with Excel or graphing with Calc. Drawing some quick graphs is another good way to check your data for weirdness. For example, if you've entered the height and leg length of a bunch of people, draw a quick graph with height on the X axis and leg length on the Y axis. The two variables should be pretty tightly correlated, so if you see some outlier who's 2.10 meters tall and has a leg that's only 0.65 meters long, you know to double-check the data for that person.

#### Absolute and relative cell references

In the formula "=B1+C1", B1 and C1 are relative cell references. If this formula is in cell D1, "B1" means "that cell that is two cells to the left." When you copy cell D1 into cell D2, the formula becomes "=B2+C2"; when you copy it into cell G1, it would become "=E1+F1". This is a great thing about spreadsheets; for example, if you have long columns of numbers in columns A and B and you want to know the sum of each pair, you don't need to type "=Bi+Ci" into each cell in column D, where *i* is the row number; you just type "=B1+C1" once into cell D1, then copy and paste it into all the cells in column D at once.

Sometimes you don't want the cell references to change when you copy a cell; in that case, you should use absolute cell references, indicated with a dollar sign. A dollar sign before the letter means the column won't change, while a dollar sign before the number means the row won't change. If the equation in cell E1 is "=$B1*C$1+$D$1^2" and you copy it into cell F2, the first term would change from $B1 to $B2 (because you've moved down one row), the second term would change from C$1 to D$1 (because you've moved right one column), and the third term, $D$1, wouldn't change (because it has dollar signs before both the letter and the number). So if you had 100 numbers in column B, you could enter "=B1-AVERAGE(B$1:B$100)" in cell C1, copy it into cells C2 through C100, and each value in column B would have the average of the 100 numbers subtracted from it.

#### Paste Special

When a cell has a formula in it (such as "=B1*C1+D1^2"), you see the numerical result of the formula (such as "7.15") in the spreadsheet. If you copy and paste that cell, the formula will be pasted into the new cell; unless the formula only has absolute cell references, it will show a different numerical result. Even if you use only abolute cell references, the result of the formula will change every time you change the values in B1, C1 or D1. When you want to copy and paste the number that results from a function in **Excel**, choose "Paste Special" from the Edit menu and then click the button that says "Values." The number (7.15, in this example) will be pasted into the cell.

In **Calc**, choose "Paste Special" from the Edit menu, uncheck the boxes labelled "Paste All" and "Formulas," and check the box labelled "Numbers."

#### Change number format

To change the number of decimal places that are displayed in a cell in **Excel**, choose "Cells" from the Format menu, then choose the "Number" tab. Under "Category," choose "Number" and tell it how many decimal places you want to display. Note that this only changes the way the number is displayed; all of the digits are still in the cell, they're just invisible.

The default format in **Excel** ("General" format) automatically uses scientific notation for very small or large numbers. If you've changed the format of a cell to "Number" format with a fixed number of decimal places, very small numbers will be rounded to 0. If you see a 0 in a spreadsheet where you expect a non-zero number (such as a P-value), change the format to General.

The default format in **Calc** is a fixed format with only two digits past the decimal point, and Calc doesn't have a format that automatically uses scientific notation for small numbers, which is annoying. One way to get around this is to create a user-defined format. Select the cells you want to fix, choose "Cells" from the Format menu, and paste the following into the box labelled "Format code":

[>0.00001]0.######;[<-0.00001]0.######;0.00E-00

The spreadsheets I've created for these web pages use this format for the cells containing P-values and other results. It will display 0 as 0.00E00, but otherwise it works pretty well.

If a column is too narrow to display a number in the specified format, digits to the right of the decimal point will be rounded. If there are too many digits to the left of the decimal point to display them all, the cell will contain "###". Make sure your columns are wide enough to display all your numbers.

### Useful spreadsheet functions

There are hundreds of functions in Excel and Calc; here are the ones that I find most useful for statistics and general data handling. Note that where the argument (the part in
parentheses) of a function is "*Y*", it means a single number or a single
cell in the spreadsheet. Where the argument says "*Ys*", it means more than
one number or cell. See AVERAGE(Ys) for an example.

All of the examples here are given in Excel format. Calc uses a semicolon instead of a comma to separate multiple parameters; for example, Excel would use "=ROUND(A1, 2)" to return the value in cell A1 rounded to 2 decimal places, while Calc would use "=ROUND(A1; 2)". If you import an Excel file into Calc or export a Calc file to Excel format, Calc automatically converts between commas and semicolons. However, if you type a formula into Calc with a comma instead of a semicolon, Calc acts like it has no idea what you're talking about; all it says is "#NAME?".

I've typed the function names in all capital letters to make them stand out, but you can use lower case letters.

#### Math functions

**ABS(Y)** Returns the absolute value of a number.

**EXP(Y)** Returns *e* to the *y*th power. This is the inverse of LN, meaning that EXP(LN(Y)) equals *Y*.

**LN(Y)** Returns the natural logarithm (logarithm to the base *e*)
of *Y*.

**LOG(Y)** Returns the base-10 logarithm of *Y*. The inverse of LOG is raising 10 to the Yth power, meaning 10^(LOG(Y)) returns *Y*.

**RAND()** Returns a pseudorandom number, equal to or greater than
zero and less than one. You must use empty parentheses so the spreadsheet knows
that RAND is a function. For a pseudorandom number in some other range,
just multiply; thus =RAND()*79 would give you a number greater than or
equal to 0 and less than 79. The value will change every time you enter
something in any cell. One use of random numbers is for randomly assigning individuals to different treatments; you could enter "=RAND()" next to each individual, Copy and Paste Special the random numbers, Sort the individuals based on the column of random numbers, then assign the first 10 individuals to the placebo, the next 10 individuals to 10 mg of the trial drug, etc.

A "pseudorandom" number is generated by a mathematical function; if you started with the same starting number (the "seed"), you'd get the same series of numbers. Excel's pseudorandom number generator bases its seed on the time given by the computer's internal clock, so you won't get the same seed twice. There are problems with Excel's pseudorandom number generator, but the numbers it produces are random enough for anything you're going to do as a biologist.

**ROUND(Y,D)** Returns *Y* rounded to *D* digits. For example,
=ROUND(37.38, 1) returns 37.4, =ROUND(37.38, 0) returns 37, and
=ROUND(37.38, -1) returns 40. Numbers ending in 5 are rounded up (away
from zero), so =ROUND(37.35,1) returns 37.4 and =ROUND(-37.35) returns
-37.4.

**SQRT(Y)** Returns the square root of *Y*.

**SUM(Ys)** Returns the sum of a set of numbers.

#### Logical functions

**AND(logical_test1, logical_test2,...)** Returns TRUE if logical_test1,
logical_test2... are all true, otherwise returns FALSE. As an example, let's
say that cells A1, B1 and C1 all contain numbers, and you want to know
whether they're all greater than 100. One way to find out would be with
the statement =AND(A1>100, B1>100, C1>100), which would return TRUE if
all three were greater than 100 and FALSE if any one were not greater
than 100.

**IF(logical_test, A, B)** Returns *A* if the logical test is true,
*B* if it is false. As an example, let's say you have 1000 rows of data in columns A through E, with a unique ID number in column A, and you want to check for duplicates. Sort the data by column A, so if there are any duplicate ID numbers, they'll be adjacent. Then in cell F1, enter "=IF(A1=A2, "duplicate"," "). This will enter the word "duplicate" if the number in A1 equals the number in A2; otherwise, it will enter a blank space. Then copy this into cells F2 through F999. Now you can quickly scan through the rows and see where the duplicates are.

**ISNUMBER(Y)** Returns TRUE if *Y* is a number, otherwise returns FALSE. This can be useful for identifying cells with missing values. If you want to check the values in cells A1 to A1000 for missing data, you could enter "=IF(ISNUMBER(A1), "OK", "MISSING") into cell B1, copy it into cells B2 to B1000, and then every cell in A1 that didn't contain a number would have "MISSING" next to it in column B.

**OR(logical_test1, logical_test2,...)** Returns TRUE if one or more of logical_test1,
logical_test2... are true, otherwise returns FALSE. As an example, let's
say that cells A1, B1 and C1 all contain numbers, and you want to know
whether any is greater than 100. One way to find out would be with
the statement =OR(A1>100, B1>100, C1>100), which would return TRUE if
one or more were greater than 100 and FALSE if all three were not greater
than 100.

#### Statistical functions

**AVERAGE(Ys)** Returns the arithmetic mean of a set of numbers.
For example, AVERAGE(B1..B17) would give the mean of the numbers in cells
B1..B17, and AVERAGE(7, A1, B1..C17) would give the mean of 7, the number
in cell A1, and the numbers in the cells B1..C17. Note that Excel only
counts those cells that have numbers in them; you could enter
AVERAGE(A1:A100), put numbers in cells A1 to A9, and Excel would
correctly compute the arithmetic mean of those 9 numbers. This is true
for other functions that operate on a range of cells.

**BINOMDIST(S, K, P, cumulative_probability)** Returns the
binomial probability of getting *S* "successes" in *K* trials, under the
hypothesis that the probability of a success is *P*. The argument "cumulative_probability" should be TRUE if you
want the cumulative probability of getting *S* or fewer successes, while it
should be FALSE if you want the probability of getting exactly *S*
successes. (**Calc** uses 1 and 0 instead of TRUE and FALSE.)

**CHIDIST(Y, df)** Returns the probability associated with a
variable, *Y*, that is chi-square distributed with *df* degrees of
freedom. If you use SAS or some other program and it gives the result as "Chi-sq=78.34, 1 d.f., P<0.0001", you can use the CHIDIST function to figure out just how small your P-value is; in this case, "=CHIDIST(78.34, 1)" yields 8.67×10^{-19}.

**CONFIDENCE(alpha, standard-deviation, sample-size)** Returns the confidence interval of a mean, *assuming you know the population
standard deviation.* Because you don't know the population standard deviation, **you should never use this function**; instead, see the web page on confidence intervals for instructions on how to calculate the confidence interval correctly.

**COUNT(Ys)**Counts the number of cells in a range that contain
numbers; if you've entered data into cells A1 through A9, A11, and A17, "=count(A1:A100)" will yield 11.

**DEVSQ(Ys)**Returns the sum of squares of deviations of data
points from the mean. This is what statisticians refer to as the "sum of
squares."

**FDIST(Y, df1, df2)** Returns the probability value associated
with a variable, *Y*, that is F-distributed with *df1* degrees of freedom in
the numerator and *df2* degrees of freedom in the denominator. If you use SAS or some other program and it gives the result as "F=78.34, 1, 19 d.f., P<0.0001", you can use the FDIST function to figure out just how small your P-value is; in this case, "=FDIST(78.34, 1, 19)" yields 3.62×10^{-8}.

**MEDIAN(Ys)** Returns the median of a set of numbers. If the
sample size is even, this returns the mean of the two middle numbers.

**MIN(Ys)** Returns the minimum of a set of numbers. Useful for
finding the range, which is MAX(Ys)-MIN(Ys).

**MAX(Ys)** Returns the maximum of a set of numbers.

**RANK(X, Ys, type)** Returns the rank of *X* in the set of Ys. If *type* is set to 0, the largest number has a rank of 1; if *type* is set to 1, the smallest number has a rank of 0. For example, if cells A1:A8 contain the numbers 10, 12, 14, 14, 16, 17, 20, 21, "=RANK(A2, A1:A8, 0)" returns 7 (the number 12 is the 7th largest in that list), and "=RANK(A2, A1:A8, 1)" returns 2 (it's the 2nd smallest).

Spreadsheets give tied ranks the smallest rank; both of the 14's in the above list would get a rank of 5, as they are tied for 5th largest. The nonparametric tests used in statistics require that ties be given the average rank; both of the 14's in the above list should get a rank of 5.5, the average of 5 and 6, as they are the 5th and 6th largest. This formula shows how to get ranks with ties handled correctly:

=AVERAGE(RANK(A1, A1:A8, 0), 1+COUNT(A1:A8)-RANK(A1, A1:A8, 1))

**STDEV(Ys)** Returns an estimate of the standard deviation based
on a population sample. This is the function you should use for standard deviation.

**STDEVP(Ys)** Returns the standard deviation of values from an
entire population, not just a sample. **You should never use this
function**.

**SUM(Ys)** Returns the sum of the Ys.

**SUMSQ(Ys)** Returns the sum of the squared values. Note that
statisticians use "sum of squares" as a shorthand term for the sum of the
squared deviations from the mean. SUMSQ does not give you the sum of
squares in this statistical sense; for the statistical sum of squares,
use DEVSQ. You will probably never use SUMSQ.

**TDIST(Y, df, tails)** Returns the probability value associated
with a variable, *Y*, that is t-distributed with *d*f degrees of freedom and *tails* equal to one or two (you'll almost always want the two-tailed test). If you use SAS or some other program and it gives the result as "t=78.34, 19 d.f., P<0.0001", you can use the TDIST function to figure out just how small your P-value is; in this case, "=TDIST(78.34, 19, 2)" yields 2.56×10^{-25}.

**VAR(Ys)** Returns an estimate of the variance based on a
population sample. This is the function you should use for variance.

**VARP(Ys)** Returns the variance of values from an entire
population, not just a sample. **You should never use this
function**.

### Reference

### ⇐ Previous topic | Next topic ⇒

This page was last revised September 14, 2009. Its URL is
http://udel.edu/~mcdonald/statspreadsheet.html. It may be cited as pp. 266-273 in: McDonald, J.H. 2009. Handbook of Biological Statistics (2nd ed.). Sparky House Publishing, Baltimore, Maryland.

©2009 by John H. McDonald. You can probably do what you want with this content; see the permissions page for details.