Substitutional asymmetry programs


These programs test protein sequence data for substitutional asymmetry using the technique described in McDonald et al. (1999). Under the neutral model of protein evolution, a comparison of sequences from two species should show equal numbers of substitutions in each direction for each pair of amino acids. For example, when comparing the mesophile (moderate habitat temperature) Bacillus subtilis with the thermophile (high habitat temperature) Geobacillus stearothermophilus, you would expect the number of sites with alanine in B. subtilis and proline in G. stearothermophilus to equal the number of sites with proline in B. subtilis and alanine in G. stearothermophilus. In fact, a data set of 229 proteins showed 122 aligned sites with the first pattern and only 69 sites with the second pattern, suggesting that proline is preferred by natural selection over alanine in G. stearothermophilus (McDonald et al. 1999).

For some pairs of amino acids, substitutional asymmetry might be explained by differences in G+C content. For example, there are 412 lysine-->arginine substitutions between B. subtilis and G. stearothermophilus and only 157 arginine-->lysine substitutions. This may be explained by the higher G+C content of the G. stearothermophilus genome and the higher G+C content of arginine codons (AGR or CGN) over lysine (AAR). For inferring general patterns about which amino acids are favored in different environments, it is therefore preferable to compare species with similar genomic G+C contents, such as mesophilic Methanococcus spp. with the thermophile Methanocaldococus jannaschii (Haney et al. 1999, McDonald et al. 1999) or the moderate temperature (but radiation resistant!) Deinococcus radiodurans with the thermophile Thermus thermus (McDonald 2001).

Substitutional asymmetry is worth looking for in any comparison of related species that differ in some factor of the environment that might affect all proteins. Most examples of substitutional asymmetry come from comparisons of mesophiles and thermophiles (Haney et al. 1999, McDonald et al. 1999, McDonald 2001, Nishio et al. 2003), but substitutional asymmetry has also been found between the moderate-pressure Pyrococcus furiosus and the barophile P. abyssi (Di Giulio 2005).

The programs available here require a set of aligned homologous pairs of protein sequences from two species. The AmbiguityRemover program takes the file of sequences and puts ambiguously aligned sites next to alignment gaps in lower case, so they can be ignored by the AsymmetryCounter program. AsymmetryCounter counts the number of substitutions in each direction for each pair of amino acids, and compares the ratio to the expected 1:1 ratio using a test of goodness-of-fit. AsymmetryScaler finds an asymmetry index that illustrates how preferred or unpreferred each amino acid is in one species.

The programs were written for Windows and MacOS X using the Free Pascal compiler; if you need them compiled for MacOS 7, 8, or 9, e-mail me. Free Pascal is available for several other operating systems, and because the programs use a simple console interface, it should be easy to compile them for other operating systems (or using other Pascal compilers, I imagine).

Download the asymmetry programs

Bug fix alert: The original version of AsymmetryCounter did not handle the uppercase letter 'X' in protein sequences correctly. This was only a problem if you did not use AmbiguityRemover to put ambiguously aligned sites in lowercase, so they would be ignored by AsymmetryCounter. This bug was fixed in a new version made available on Feb. 15, 2006.

Here are the Windows versions:

And here is a disk image for Mac OS X with all three programs:

You may also want to download an example data set, 206 pairs of aligned sequences from Deinococcus radiodurans and Thermus thermophilus. This is the data set that was analyzed in McDonald (2001).

If you want to see the source code, here are the files:

Creating the input file

The first step is to create a file containing pairs of aligned protein sequences. The sequences must be in Fasta format with single-letter amino acid symbols as capital letters. The file must be saved as plain text (with a '.txt' extension) and must be placed in the same folder as the asymmetry programs. Each pair of sequences must be in the same order, like this:

>Gene1_Sp.1
MEGYLVLEDGTSFSGELDG--HENCTGEAVFFTGMTGYQEVLTDPSYKGQIIVFTYPLIG
>Gene1_Sp.2
MKAYLHVASGKTFSGELAAPLEEKVSGEIVFFTGMTGYQEVLTDPSYKNQIIVFTYPLIG
>Gene2_Sp.1
MFMKSTGIVRKVDELGRVVIPIELRRTLGIAEKDALEIYVDDEKIILKKYKPNMTCQVTG
>Gene2_Sp.2
----MTGIVRKVDELGRVVIPIELRRTLDINEKDALEIYVDDDRIILKKYKPNMTCVVTG
           .
           .
           .

AmbiguityRemover

The second step is to run the AmbiguityRemover program. This finds each gap in the alignment and walks in either direction from the gap until it finds N adjacent sites that match in the two sequences, where N can be set from 1 to 10. Everything between the matching adjacent sites and the gap is considered "ambiguously aligned" and is put into lower case letters so the AsymmetryCounter program can ignore it. Here is what the above data look like after running them through the program with N set equal to 2 (which I find gives good-looking alignments):

>Gene1_Sp.1
MEGYLVLEDGTSFSGELdg--henctGEAVFFTGMTGYQEVLTDPSYKGQIIVFTYPLIG
>Gene1_Sp.2
MKAYLHVASGKTFSGELaapleekvsGEIVFFTGMTGYQEVLTDPSYKNQIIVFTYPLIG
>Gene2_Sp.1
mfmksTGIVRKVDELGRVVIPIELRRTLGIAEKDALEIYVDDEKIILKKYKPNMTCQVTG
>Gene2_Sp.2
----mTGIVRKVDELGRVVIPIELRRTLDINEKDALEIYVDDDRIILKKYKPNMTCVVTG

AmbiguityRemover also produces a file of summary information about each gene pair. You should scan through this to see if any sequences have a length of 0; if they do, it probably is because the two sequences are different lengths in the input file. You can also use the information in the summary file if you want to delete sequences that are shorter or more divergent than some pre-established criteria.

AsymmetryCounter

AsymmetryCounter takes the file of aligned sequences produced by AmbiguityRemover and counts, for each pair of amino acids, the number of substitutions in each direction. It compares the ratio of substitutions in one direction and substitutions in the opposite direction to the expected 1:1 ratio using a goodness-of-fit test. If a pair of amino acids has less than 50 total substitutions, the exact binomial test is used and the P-value is reported; if there are 50 or more substitutions, the G-test of goodness-of-fit is used and the G-value and P-value are reported. If the G-value is greater than 454, the P-value is given as "<1E-100."


           sp. 1-->sp. 2            G-value         P
--------------------------------   --------     ---------
A-->A     3169    A-->A     3169     0.0000     1.0000000
A-->C       14    C-->A       44    16.1567     0.0000583***
A-->D       52    D-->A      108    19.9588   7.9131E-006***
A-->E      357    E-->A      164    73.1571   1.1973E-017***
A-->F       42    F-->A       20     7.9153     0.0049019**
A-->G      191    G-->A      292    21.2548   4.0215E-006***
A-->H       39    H-->A       23     4.1427     0.0418148*
A-->I       35    I-->A       52     3.3242     0.0682687
                    .
                    .
                    .
A-->V      224    V-->A      266     3.6007     0.0577537
A-->W        8    W-->A       12       ----     0.5034447

Above are the some lines of output from AsymmetryCounter when run on the example data set. The first line shows that there are 3169 sites with alanine in both species. The second line shows that there are 14 sites with alanine in species 1 and cysteine in species 2, and 44 sites with the opposite pattern. Because there are more than 50 substitutions, it is tested with the G-test, giving a G-value of 16.1567 and a P-value of 0.0000583. The third line has a P-value of 7.9131 x 10-6. The last line shows a pair of amino acids (alanine and tryptophan) with fewer than 50 substitutions. This is tested with the exact binomial test, giving a P-value of 0.5034447. P-values less than 0.05 get one star, less than 0.01 get two stars, and less than 0.001 get three stars.

AsymmetryCounter also produces a file with a 20x20 substitution matrix, for use by AsymmetryScaler.

AsymmetryScaler

AsymmetryScaler takes the substitution matrix produced by AsymmetryCounter and produces an asymmetry index, in which the amino acids are arranged from least preferred in species 2 to most preferred in species 2. The expected ratio of substitutions in one direction (QAB, the proportion of sites with amino acid A in species 1 and amino acid B in species 2) to substitutions in the opposite direction (QBA) is related to the difference in asymmetry index values using the equation:

ln(QAB/QBA)=AIB-AIA

(This equation was shown incorrectly in McDonald et al. (1999), but the results there were based on the correct version shown here.) The scale is found that minimizes the weighted differences between the observed and expected ratios.

AsymmetryScaler minimizes the difference between observed and predicted ratios by first assigning each amino acid a random asymmetry index, then moving them, one at a time, to find the position that minimizes the difference. This is repeated for each amino acid, then starting with the new asymmetry index values, the amino acids are again slid around to reduce the total deviation. This is repeated until the total deviation (the D-value) stops getting smaller.

I worry about getting trapped in local minima, although I don't know whether that's mathematically possible for this system, so you can do replicate runs in AsymmetryScaler. Each replicate starts with a different set of random asymmetry index values, and each slides the amino acids around in a different order. The replicate with the lowest final D-value is used. If you choose three decimal places of precision, the program takes a very long time to run. I suggest running it with one decimal place to make sure everything is set up correctly, then run it overnight with three decimal places of precision and 100 or so replicates. You may need to turn off screen savers or energy savers so the program will keep running overnight.


AA   Asymmetry Index
--   ---------------
 N          -0.95330
 D          -0.76330
 S          -0.67030
 C          -0.66830
 T          -0.54330
 Q          -0.37230
 G          -0.35430
 M          -0.26830
 I          -0.15230
 A          -0.04630
 H           0.09270
 V           0.13770
 K           0.38670
 L           0.52070
 E           0.53670
 F           0.54070
 W           0.57970
 R           0.59270
 Y           0.66770
 P           0.73670

The output of AsymmetryScaler, as shown above for the example data set, is a list of the 20 amino acids with their asymmetry indices (one alphabetical list, then one list in order of asymmetry index). Note that the difference between asymmetry indices is what predicts the substitution ratio, so the midpoint of the scale is arbitrary. In McDonald et al. (1999) and McDonald (2001), I set the midpoint equal to 1. The program now sets the midpoint equal to 0, which I now think makes more sense.


Sp.1-->Sp.2      N        p      E(p)
-----------   ------   ------   ------
   N-->D         117   0.5556   0.5474
   N-->S          90   0.6222   0.5703
   N-->C           3   0.3333   0.5708
   N-->T          70   0.5714   0.6011
   N-->Q          46   0.6087   0.6413
   N-->G          86   0.7209   0.6454

AsymmetryScale also outputs the list of pairs of amino acids along with the substitutional proportion predicted from the asymmetry index. The first line indicates that 0.5556 of the 117 substitutions involving asparagine (N) and aspartate (D) have asparagine in species 1 and aspartate in species 2. The proportion predicted from the difference in the asymmetry indices is 0.5474. Plotting the correlation between the expected and the observed asymmetry proportions could give you an idea of how well the asymmetry index is summarizing the data. You'll probably want to omit those pairs with small numbers of substitutions, to reduce the noise.

References

Di Giulio, M. 2005. A comparison of proteins from Pyrococcus furiosus and Pyrococcus abyssi: barophily in the physicochemical properties of amino acids and in the genetic code. Gene 346: 1-6.

Haney, P.J., J.H. Badger, G.L. Buldak, C.I. Reich, C.R. Woese, and G.J. Olsen. 1999. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species. Proc. Natl. Acad. Sci. 96:3578-3583.

McDonald, J.H., A.M. Grasso, and L.K. Rejto. 1999. Patterns of temperature adaptation in proteins from Methanococcus and Bacillus. Mol. Biol. Evol. 16: 1785-1790.

McDonald, J.H. 2001. Patterns of temperature adaptation in proteins from the bacteria Deinococcus radiodurans and Thermus thermophilus. Mol. Biol. Evol. 18: 741-749.

Nishio, Y., Y. Nakamura, Y. Kawarabayasi, Y. Usuda, E. Kimura, S. Sugimoto, K. Matsui, A. Yamagishi, H. Kikuchi, K. Ikeo, and T. Gojobori. 2003. Comparative complete genome sequence analysis of the amino acid replacements responsible for the thermostability of Corynebacterium efficiens. Genome Res. 13: 1572-1579.


Go back to John McDonald's home page

Send comments to mcdonald@udel.edu

Last Updated: February 25, 2006.
URL of this document: http://udel.edu/~mcdonald/asymmetry.html