BISC656 Homework 2

Estimating phylogeny

Due Friday, March 7

For this homework, you'll estimate the phylogeny of some of the major taxa of mammals. You'll start by downloading sequences of a mitochondrial protein from Genbank. Then you'll align the sequences using ClustalW. You'll use the Phylip programs to estimate the phylogeny using two different techniques, neighbor joining and maximum parsimony. You may do bootstrap analysis to estimate the confidence you should have in each clade, and you'll write a few sentences comparing the trees.

Students who took my section of BISC413 last semester did a similar lab and are already experts at this. Those students are therefore required to do the extra work described at the end. Other students can do the extra work if you feel like a challenge, but it won't affect your grade.

Obtain sequences

1. Choose your protein. You will use the protein corresponding to the month you were born:

Obtain the sequences. Go to the Entrez Protein database. In the search box, enter the name of your protein and the name of one of the species listed below. Just enter the species name, such as "Monodelphis domestica". I've picked one species from most of the orders of placental mammals; for some of the interesting orders, I've picked multiple families. The marsupial is an outgroup.

  1. outgroup: Metatheria: Monodelphis domestica (gray short-tailed opossum)
  2. Cetacea: Balaena mysticetus (bowhead whale)
  3. Chiroptera: Chalinolobus tuberculatus (New Zealand long-tailed bat)
  4. Cynocephalidae: Cynocephalus variegatus (Malayan flying lemur)
  5. Dasypodidae: Dasypus novemcinctus (nine-banded armadillo)
  6. Dugongidae: Dugong dugon (dugong)
  7. Fissipedia: Canis familiaris (dog)
  8. Hippopotamidae: Hippopotamus amphibius (hippopotamus)
  9. Hyracoidea: Procavia capensis (cape rock hyrax)
  10. Insectivora: Erinaceus europaeus (European hedgehog)
  11. Lagomorpha: Oryctolagus cuniculus (rabbit)
  12. Perissodactyla: Equus caballus (horse)
  13. Pinnipedia: Phoca vitulina (harbor seal)
  14. Primates: Homo sapiens (humans)
  15. Proboscidea: Elephas maximus (Asiatic elephant)
  16. Rodentia: Mus musculus (house mouse)
  17. Ruminantia: Bos taurus (cattle)
  18. Scandentia: Tupaia belangeri (northern tree shrew)
  19. Suina: Sus scrofa (pig)
  20. Tubilidentata: Orycteropus afer (aardvark)
  21. Tylopoda: Lama pacos (alpaca)

Separate the protein name and the species name with the word and, like this: NADH dehydrogenase subunit 1 AND homo sapiens, then click "Go." You should get a list of one or more protein sequences. Click on the link for the first one. You'll see the entry for that protein in "GenPept" format. Use the pull-down menu in the upper left labelled "Display" to change it to "FASTA" format. FASTA format is a much simpler format with just one line of descriptive information, followed by the sequence. Copy the FASTA-formatted sequence into a file, either a plain text file (NotePad on Windows or TeachText on Mac) or a word processor.

Repeat the above process for each species in the list. The end result should be a file containing 21 protein sequences. I think each of the above species has a sequence in the database for each of the proteins, but I'm not sure. If your protein is not found in a species, search for the protein in that family or order. For example, if cytochrome b is not found in the alpaca, search for "tylopoda and cytochrome b", and note which species you're now using.

Modify the sequence names. The next program you'll use, ClustalW, only uses the first 30 characters of the sequence name, which is a bunch of numbers. Put the species name immediately after the ">", changing ">gi|17981857|ref|NP_536847.1| ATP synthase F0 subunit 8 [Homo sapiens]" to ">Homo_sapiens gi|17981857|ref|NP_536847.1| ATP synthase F0 subunit 8". You can use either the scientific names or the common names (human, aardvark, alpaca, etc.), whichever you prefer. Don't put spaces in the name, as Phylip will only read up to the first space.

Your set of sequences should look something like this:

>opossum
MTNLRKNYPLMKIINHSFIDLPAPSNISAWWNFGSLLGMCLIIQILTGLF
LAMHYTSDTLTAFSSVAHICRDVNYGWLIRNLHANGASMFFMCLFLHVGR
>hippo
MTNIRKSHPLMKIINDAFVDLPAPSNISSWWNFGSLLGVCLILQILTGLF
LAMHYTPDTLTAFSSVTHICRDVNYGWVIRYMHANGASIFFICLFTHVGR
  .
  .
  .

Align the sequences. The next task is to align the protein sequences, putting dashes in them so that matching amino acids are at the same position in each sequence. You'll use the program ClustalW. Copy your set of FASTA-format protein sequences and paste it into the box on that web page, then hit the "Run" button. It may take a few minutes to align the sequences. Once the program is done, scroll down to look at the aligned sequences. If there's something about it that doesn't make sense, ask me for help. This default format uses stars to indicate sites that are identical in all the sequences; this makes it easier to see how well the sequences match.

If you get an error message saying that zero sequences were read, your input sequences are probably in the wrong format. Make sure that the first character on each name line is a ">", with no space after it, then the name. Make sure there's a line return after each name line; hit "return" after each name (it doesn't hurt to have a blank line between the name line and the sequence). Make sure there's a line return at the end of each sequence, before the next name line.

Once you have a reasonable-looking alignment, run ClustalW again, only this time save the output in Phylip format. Go back to the ClustalW page and change the "Output Format" parameter to "phylip". Hit "Run" again, and once you get the output, save the Phylip-formatted file. Give it a more informative name, something like "cytochrome.aln" instead of "clustalw-20050927-17530977.aln". If you copy the Phylip-formatted alignment from the web page to a text editor, make sure you include the two numbers at the beginning of the alignment; they tell Phylip how many taxa and how many characters the alignment contains.

There are lots of different techniques that are used to estimate phylogenies from protein or DNA data, and the advocates of each technique can be rather passionate about its superiority. This can make molecular systematics intimidating for the novice. Today you'll analyze your data using two different techniques, neighbor joining and parsimony.

Download Phylip. Phylip is a package of programs that will perform many of the more popular techniques of estimating phylogenies. It is available for free for just about any kind of computer. Go to the Phylip home page and download the appropriate files for the computer you're using, then go to the Phylip installation instructions and follow the instructions there on how to install the programs.

Look for the file named "font1" in the "exe" directory, and rename it "fontfile".

Estimate the phylogeny using neighbor-joining. Put your Phylip-formatted file of sequences, that you created using ClustalW, in the "exe" directory. Find the "protdist" application and start it. It will ask you for a file name; enter the name of your sequence file (including the extension, such as ".aln" or ".txt"). Next you'll see a bunch of settings; accept them all. The output will be a matrix of distances between your sequences, in a file called "outfile". It's important to change the name of "outfile" to something else (like "cytochromedistances") every time you run a Phylip program, because otherwise the next program you use will overwrite it.

If you get an error message saying that the program could not find the file whose name you typed in, make sure you're typing the extension (such as ".txt" or ".aln") that is part of the file name. To see the extension if it is hidden, choose Properties (in Windows) or Get Info (on a Mac).

If you get an error message saying "the function asked for an inappropriate amount of memory", it has nothing to do with memory; it means your input file of sequences are in the wrong format. Make sure they're in Phylip format, like this:

    21    382
hippopotam MTNIRKSHPL MKIINDAFVD LPAPSNISSW WNFGSLLGVC LILQILTGLF
alpaca     MTNIRKSHPL LKIVNNAFID LPAPSNISSW WNFGSLLGIC LIMQIMTGLF
cattle     MTNIRKSHPL MKIVNNAFID LPAPSNISSW WNFGSLLGIC LILQILTGLF

If you copied the alignment from the web page to a word processor such as Microsoft Word, make sure you saved it as "text only" format. And if you created the alignment on a Mac and moved it to a Windows computer, or vice versa, you may have problems (they use different hidden symbols to indicate line breaks).

Next, start the program "neighbor". Enter the name of the outfile you just created and renamed. Then enter "o" to change the outgroup to the species you decided on; for example, if the opossum sequence is the twelfth one in the file, you'll enter "o" and then "12". (See where the opossum is in the Phylip-formatted alignment file, not the original set of sequences you input into ClustalW). Just accept all the other settings. Once the program says "Output written to file 'outfile'", you can quit this program. Change the name of "outfile" and "outtree" to something else, like "cytochromeneighboroutfile" and "cytochromeneighborouttree".

Estimate the phylogeny using parsimony. Run the program "protpars" just like you did "neighbor", except use your Phylip-formatted sequence file as input. Tell it which taxon to use as the outgroup, and leave the other parameters unchanged. When it'd done running, make sure you give "outfile" and "outtree" different names.

Draw your trees. Phylip's tree-drawing program is rather clumsy, so we'll use the program NJplot, available here. Download the appropriate version for your computer.

Start the program NJplot, and tell it to open your neighbor-joining tree (which you named something like "cytochromeneighbortree"). If the correct species is not the outgroup, click on the "New outgroup" button and click on the "#" next to the species you want to be the outgroup. Print this tree out (or copy and paste it to a word processor document).

Use NJplot to display your parsimony tree, and print it as well. You may want to play with the "Swap nodes" function to get the two trees to have the species in the same order, which will make it easier to see similarities and differences between the two trees.

Extra work

If you were in my section of BISC413 last semester, all the work so far is pretty familiar. Those students are therefore required to do bootstrap analysis of your trees. First, use the Seqboot program to create 100 data sets. Use your original Phylip-formatted alignment as the input; you don't need to change any of the parameters.

Next, use Protdist to get the distances for the 100 data sets in your Seqboot output. Enter "M" to tell it to analyze multiple data sets.

Next, use Neighbor on the output from Protdist. Use the "O" command to tell it which species is the outgroup, use the "J" command to tell it to randomize the input order, and use the "M" command to tell it to analyze multiple data sets.

Next, run Consense on the outtree file. Use "O" to tell it which species is the outgroup, and use "R" to tell it to treat the tree as rooted.

Open the outtree from Consense in NJplot. To display the bootstrap numbers, don't do the obvious thing, which is to click on the "bootstrap values" button; it won't work. Instead, click on the "branch lengths" button, and you'll see, not branch lengths, but bootstrap values.

Run Protpars on the output from Seqboot, with the same options you used for Neighbor, then run Consense.

If you weren't in my section of BISC413, you can do the bootstrap analysis if you want an extra challenge. You won't get extra credit if you do it right, and you won't get points off if you try and don't get the bootstrap right.

To turn in:

On Friday, March 7, turn in your two trees and about a half page of typed discussion. In you discussion, describe any problems you had (such as not finding your gene in a species and having to substitute a different species); describe the similarities and differences between the trees. Then use the trees to discuss the question, how many times have flippers (found in whales, seals, and dugongs) evolved in mammals? What might the ancestors of flipper-bearing mammals have looked like?


Return to the course syllabus

Return to John McDonald's home page

This page was last revised March 1, 2007. Its URL is http://udel.edu/~mcdonald/evolhw2.html