BISC413 Lab 7, Sept. 22: Clustal and Phylip

Today you'll get more practice with BLAST. Then you'll use two tools for analyzing DNA and protein sequences. First, you'll align the protein sequences with ClustalW. Then you'll use Phylip to estimate the phylogeny of the organisms you're analyzing. Finally, you'll look at the aligned sequences and assign substitutions to different branches of the phylogeny.

Find sequences with BLAST

Some proteins are highly constrained, meaning that almost all mutations in their genes are harmful. They therefore evolve very slowly (the classic example is histone H4, with only two amino acid differences between peas and cows). Other protein genes evolve very quickly, either because many mutations are neutral (have no effect on fitness) or some mutations are adaptive (increase the fitness).

For today's lab, you'll analyze the gene you worked on in last Tuesday's lab, either the gene for your fly mutant or the gene next to it. For the purpose of learning to do the analysis, I want you to have a data set where there are some amino acid differences between the species, but not a confusingly large number. The first step will therefore be to pick a group of species that are closely enough related that most of the amino acid sites are unchanged, but distant enough that the protein sequences aren't identical.

BLAST your protein sequence from D. melanogaster against Homo sapiens. Scroll down past the list of hits until you see the individual reports on genes, and look at the first one. Find the part that looks like this:


 Score =  787 bits (2032),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 377/551 (68%), Positives = 443/551 (80%), Gaps = 2/551 (0%)

If the "Identities" number is 90% or more, you have a slowly evolving gene; more than 90% of the amino acid sites are identical in flies and humans. You'll analyze taxon set A.


Taxon set A:
Homo sapiens (outgroup)
Ixodes scapularis
Drosophila melanogaster
Culex quinquefasciatus
Tribolium castaneum
Apis mellifera
Pediculus humanus corporis

If the fly-human identity was less than 90% (or you know from last week that there's no good fly-human match), BLAST your fly sequence against the human louse Pediculus humanus corporis. If the percent identity for the first fly-louse match is 90% or more, you'll analyze taxon set B.


Taxon set B:
Pediculus humanus corporis (outgroup)
Apis mellifera
Nasonia vitripennis
Tribolium castaneum
Aedes aegypti 
Culex quinquefasciatus
Drosophila melanogaster

If the fly-louse identity was less than 90%, BLAST your fly sequence against the bee Apis mellifera. If the percent identity for the first fly-bee match is 90% or more, you'll analyze taxon set C.


Taxon set C:
Apis mellifera (outgroup)
Aedes aegypti 
Anopheles gambiae
Culex quinquefasciatus
Drosophila melanogaster
Drosophila pseudoobscura
Drosophila virilis

If the fly-bee identity was less than 90%, BLAST your fly sequence against the fly Drosophila virilis. If the percent identity for the first D. melanogaster-D. virilis is 90% or more, you'll analyze taxon set D:


Taxon set D: 
Drosophila virilis (outgroup)
Drosophila pseudoobscura
Drosophila willistoni
Drosophila ananassae
Drosophila melanogaster
Drosophila yakuba
Drosophila simulans

If the D. melanogaster-D. virilis identity was less than 90%, you'll use taxon set E:


Taxon set E: 
Drosophila ananassae (outgroup)
Drosophila melanogaster
Drosophila erecta
Drosophila sechellia
Drosophila yakuba
Drosophila simulans 

plus one of the following (whichever has the highest identity with D. melanogaster):
Drosophila mauritiana
Drosophila orena
Drosophila santomea
Drosophila teissieri

If one of the species in your taxon set doesn't have a good match with D. melanogaster (a "good match" for our purposes being an identity greater than 90%), add a species from the taxon set below it.

Once you've figured out which taxon set you're using, BLAST your fly sequence against each of the other six species in your taxon set. For the first match for each species, click on the link to the left of the name. This will give you a view with a lot of information about the gene. In the upper left of the page, where it says "Format:", click on "FASTA". This will give you the protein sequence in FASTA format. Copy this sequence into a file. The first line will be some identifying information, like:


>gi|189238|gb|AAA36368.1| neuroleukin

Change it to an informative one-word identifier, like ">human". Make sure there's a return after the first line (insert the cursor after the word and hit "return".

Repeat this process for each of the species in your taxon set. You should end up with a set of seven sequences in FASTA format, all in one file. Be sure to include the D. melanogaster sequence in the file. Put the sequence from the outgroup at the top of the file.

Align the sequences with ClustalW

During the evolution of protein genes, some of the changes that occur are insertions and deletions of bits of DNA. This means that aligning the homologous sites in the protein sequences requires putting in gaps in some of them. This is actually a very tricky mathematical problem; there is no consensus on the best way to do it. The most commonly used program for aligning sequences is Clustal, which is available on the web as ClustalW. Go there and paste your set of protein sequences into the box labelled "Enter or paste a set of sequences in any supported format:". Change "Output order" to "Input". Leave the other choices to their defaults. Hit "Run". It may take a few minutes, but you'll eventually get results. Scroll down on the results page until you see the aligned sequences. They'll look like this:


Dmel            LLDGAHFMDNHFKTTPFEKNAPVILALLGVWYSNFFKAETHALLPYDQYLHRFAAYFQQG 60
bee             LLSGAHFMDQHFCTAPLEKNASILLALLGIWYHNFYKTETHALLPYDQYLHRFAAYFQQG 60
mosquito        LLDGAHYMDNHFLNAPLNENAPVILALMGIWYSNFYGAETHALLPYDQYLHRFAAYFQQG 60
human           LLSGAHWMDQHF--TPLEKNAPVLLALLGIWYINCFGCETHAMLPYDQYLHRFAAYFQQG 58
                **.***:**:**  :*:::**.::***:*:** * :  ****:*****************

Sites that are identical in all the species have an asterisk underneath them. Sites that are "similar" on a biochemical scale are indicated with a period or colon. Dashes are inserted by the Clustal program to make the sequences line up.

Copy the aligned sequences into a Word or Notepad file. Make sure the font is a fixed-width font, such as Courier.

Now go back and run Clustal again. This time, change the "Output format" to Phylip. Now, the output will look like this:


     4    563
Dmel       LLDGAHFMDN HFKTTPFEKN APVILALLGV WYSNFFKAET HALLPYDQYL HRFAAYFQQG 
bee        LLSGAHFMDQ HFCTAPLEKN ASILLALLGI WYHNFYKTET HALLPYDQYL HRFAAYFQQG 
mosquito   LLDGAHYMDN HFLNAPLNEN APVILALMGI WYSNFYGAET HALLPYDQYL HRFAAYFQQG 
human      LLSGAHWMDQ HF--TPLEKN APVLLALLGI WYINCFGCET HAMLPYDQYL HRFAAYFQQG 

Copy this output into a separate file. Be sure to include the first line, with the two numbers on it; they're an important part of Phylip format.

Estimate the phylogeny with Phylip

Now you'll estimate the phylogeny of the species using a program called Phylip. There are a large number of techniques for estimating the phylogeny of a group of species. The systematists who do this for a living have violent arguments about which technique is best; they often use several different techniques and see if they give the same results.

We're going to ignore all the disagreements and just use one technique, parsimony. This technique finds the phylogenetic tree that requires the fewest substitutions on it to fit the observed data (I'll explain it more on Thursday). We'll use Phylip, a free collection of programs written by Joseph Felsenstein (I'll tell a story about me and him on Thursday) that does many different phylogenetic techniques. While Phylip is very powerful, its command-line interface is rather confusing for beginners. Fortunately, there is a web server that runs the parsimony programs in Phylip. Go there and paste your set of Phylip-formatted sequences, including the first line with the two numbers on it, into the box labelled "Alignment file (protein alignment)". Make sure your outgroup sequence was the first one. Leave all the other commands at their defaults. Then hit "run". The first box in the output will include a phylogenetic tree, like this:


One most parsimonious tree found:

     +--------------mosquito  
     !  
  +--3  +-----------tribolium 
  !  !  !  
  !  +--6     +-----louse     
  !     !  +--4  
  !     !  !  !  +--nasonia   
  1     +--2  +--5  
  !        !     +--bee       
  !        !  
  !        +--------Dmel      
  !  
  +-----------------human  

Copy this and save it to a file. If you get more than one equally parsimonious tree, save them all. We'll do more with the trees on Thursday.


Return to the Genetics Lab syllabus

Return to John McDonald's home page

This page was last revised September 21, 2009. Its URL is http://udel.edu/~mcdonald/geneticslab7.html