I've written a program, SliderPrep, that will take aligned DNA sequences and create the input file for DNA Slider. It creates a list of polymorphisms and fixed differences, classified as synonymous, non-coding, or amino acid replacement variation.
Download the appropriate version and put it in the same folder as your data files.
Create a file with your aligned sequences in FASTA or PIR format. The nucleotides can be in upper or lower case, and you may use ambiguity codes. The outgroup sequence must be the first one in the file.
Then at the first line of the file, put two numbers: the number of exons, a space, then the "codon start" of the coding sequence. The codon start is 1, 2, or 3; it is the position in the coding sequence of the first base of the first complete codon. If you are using sequences from GenBank, you can get this information from the GenBank entry for the outgroup sequence. For example, the GenBank entry for accession number AY864503 contains these features:
CDS join(<1..15,85..290,350..650,974..>1938) /gene="Hum" /codon_start=2
If it were your outgroup, the first line of your data file would read "4 2". There are four exons (1-15, 85-290, 350-650 and 974-1938), and the first complete codon starts at position 2.
Next, add one line for each exon to your data file. The line should contain the starting and ending base for the exon, as seen in the outgroup sequence (ignoring gaps). For the above example, the top of the data file would look like this:
4 2 1 15 85 290 350 650 974 1938 >sim_hum CTTCAAGGACTATATGTAAGTGCCGCCCGTTCCAATTGACGTAATTCCCGTGCTAATCCTTGGAGAAATC
Look at the sample data set for another example.
If you are analyzing non-coding DNA, just put one line with "0 0" at the beginning of the file.
Save this file in plain text format, with the .txt extension. Put the file in the same folder as the SliderPrep program.
Now you are ready to run SliderPrep. It will ask you whether you want to ignore ambiguously aligned bases adjacent to gaps. If you say yes, the program will move outward from each gap in the alignment until it finds N sites that are identical in all of the sequences, where you tell it what N should be (I use 4). I find that when I look at sequences, gaps are often bordered by bases whose alignment isn't that obvious; I could easily imagine moving the gap back and forth a little and not making the alignment noticably worse. I prefer to ignore these ambiguous sites.
The program will ask you the name of the input file (it must have a ".txt" extension; you don't need to type the ".txt"). It will ask for the name of two files, one file for the aligned sequences and one file for the list of varying sites.
After you run the program, look at the output file with the aligned sequences. If you've chosen to ignore ambiguously aligned sites, they will be shown in lower case. The coding regions have a space between the codons, while a "|" marks exon boundaries. The amino acid for the outgroup is shown beneath each codon. Look at this file and make sure the alignment looks good.
Next, go through the list of varying sites. Write down the information in the first line of the varying site list (the length of the region and the number of sequences in the ingroup). You'll need to enter this when you run DNA Slider.The program lists all of the polymorphisms and fixed differences, and it tries to figure out whether they're synonymous, amino acid replacement, or non-coding. If there any ambiguities (Y, M, S, etc.), it will say "ambiguity" and you'll have to edit it.
If there are multiple variations in a single codon, the file will say "multiple," followed by the codon in the outgroup , then all of the different codons in the ingroup. You'll then need to look at the codons and edit the file. You may find a table of the genetic code helpful. For example, the analyzed sample data set includes this, two varying sites in the same codon:
273 P multiple CTA(L) || CTA(L) CTG(L) GTG(V) 275 P multiple CTA(L) || CTA(L) CTG(L) GTG(V)
Site 273 is polymorphic in the ingroup, and so is site 275. Site 273 is non-synonymous (changing CTG to GTG changes the amino acid from L to V), while site 275 is synonymous (CTA and CTG both code for L). So you should change the above lines to this:
273 P replacement CTA(L) || CTA(L) CTG(L) GTG(V) 275 P synonymous CTA(L) || CTA(L) CTG(L) GTG(V)
You may see a "split codon." This is where there is variation in a codon that has one base on one side of an intron, and the other two bases on the other side. Variation in a split codon is unusual enough that I've been lazy and haven't had the program figure out whether it's replacement or synonymous; if you have one of these, you'll have to look at the sequence file and figure it out yourself.
Once you've figured out whether each site is replacement, synonymous, or non-coding, you need to decide whether to run DNA Slider on all of the sites, or just the silent sites (non-coding and synonymous). I recommend using just the silent sites, so you can see whether selection on the replacement variation may be causing heterogeneity in the ratio of polymorphism to divergence at the linked silent sites. You can run DNA Slider on the data set including the replacement variation, but if you get a significant result, the list of possible interpretations will be longer.
If you've decided to just analyze the silent sites, go through the file and delete every line with a replacement. Don't worry about changing the number at the top of the file that says the total number of varying sites; the newer version of DNA Slider doesn't actually use that number any more.
After you're done editing the varying sites list, save it as a plain text file (".txt" extension), and you're ready to use DNA Slider.
Go back to John McDonald's home page
Send comments to firstname.lastname@example.org
Last Updated: May 11, 2006.
URL of this document: http://udel.edu/~mcdonald/sliderprep.html