4.1 Multiple Sequence Alignment

A multiple sequence alignment is an alignment of more than 2 sequences. It turns out that this makes the problem of alignment much more complicated, and much more computationally expensive. Dynamic programming algorithm such as Smith-Waterman can be extended to higher dimensions, but at a significant computing cost. Therefore, numerous methods have been developed to make this task faster.

4.1.1 MSA Methods

There have been numerous methods developed for computing MSAs do make them more computationally feasible.

Dynamic Programming

Despite the computational cost of MSA by Dynamic Programming, there have been approaches to compute multiple sequence alignments using these approaches [24,25]. The programs MSA [25] and MULTALIN [26] use dynamic programming. This process takes $O(L^N)$ computations for aligning $N$ sequences of length $L$. The Carrillo-Lipman Algorithm uses pairwise alignments to constrain the search space. By only considering regions of the multi-sequence alignment space that are within a score threshold for each pair of sequences, the $L^N$ search space can be reduced.

Progressive alignments

Progressive alignments begin by performing pairwise alignments, by aligning each pair of sequences. Then it combines each pair and integrates them into a multiple sequence alignment. The different methods differ in their strategy to combine them into an overall multiple sequence alignment. Most of these methods are “greedy”, in that they combine the most similar pairs first, and proceed by fitting the less similar pairs into the MSA. Some programs that could be considered progressive alignment include T-Coffee [27], ClustalW and its variants [28], and PSAlign [29].

Iterative Alignment

Iterative Alignment is another approach that improves upon the progressive alignment because it starts with a progressive alignment and then iterates to incrementally improve the alignment with each iteration. Some programs that could be considered iterative alignment include CHAOS/DIALIGN [30], and MUSCLE [31].

Full Genome Alignments

Specialized multiple sequence alignment approaches have been developed for aligning complete genomes, to overcome the challenges associated with aligning such long sequences. Some programs that align full genomes include MLAGAN (using LAGAN) [32], MULTIZ (using BLASTZ) [33], LASTZ [34], MUSCLE [31], and MUMmer [35, 36].

4.1.2 MSA File Formats

There are several file formats that are specifically designed for multiple sequence alignment. These approaches can differ in their readability by a human, or are designed for storing large sequence alignments.

Multi-FASTA Format

Probably the simplest multiple sequence alignment format is the Multi-FASTA format (MFA), which is essentially like a FASTA file, such that each sequence provides the alignment sequence (with gaps) for a given species. The deflines can in some cases only contain information about the species, and the file name, for example, could contain information about what sequence is being described by the file. For short sequences the mfa can be human readable, but for very long sequences it can become difficult to read. Here is an example $\texttt{.mfa}$ file that shows the alignment of a small (28aa) Drosophila melanogaster peptide called Sarcolamban (isoform C) with its best hits to $\texttt{nr}$.

D.melanogaster
--------------------------------------------------
--------------------------------MSEARNLFTTFGILAILL
FFLYLIYA---------------------VL---------------
>D.sechellia
--------------------------------------------------
--------------------------------MSEARNLFTTFGILAILL
FFLYLIYAPAAKSESIKMNEAKSLFTTFLILAFLLFLLYAFYEAAF
>D.pseudoobscura
MSEAKNLMTTFGILAFLLFCLYLIYASNNSKRWPTFCGEAEFRSENSESQ
LLRAFSYERLEQCPNKKYPPKQPTTTTTKPIKMNEARSLFTTFLILAFLL
FLLYAFYEA--------------------AF---------------
>D.busckii
--------------------------------------------------
--------------------------------MNEAKSLVTTFLILAFLL
FLLYAFYEA--------------------AF---------------

Clustal

The Clustal format was developed for the program $\texttt{clustal}$ [37], but has been widely used by many other programs [28, 38]. This file format is intended to be fairly human readable in that it expresses only a fixed length of the alignment in each section, or block. Here is what the Clustal format looks like for the same Sarcolamban example:

CLUSTAL W (1.83) multiple sequence alignment

D.melanogaster    ------------------------------------------------------------
D.sechellia       ------------------------------------------------------------
D.pseudoobscura   MSEAKNLMTTFGILAFLLFCLYLIYASNNSKRWPTFCGEAEFRSENSESQLLRAFSYERL
D.busckii         ------------------------------------------------------------

D.melanogaster    ----------------------MSEARNLFTTFGILAILLFFLYLIYA------------
D.sechellia       ----------------------MSEARNLFTTFGILAILLFFLYLIYAPAAKSESIKMNE
D.pseudoobscura   EQCPNKKYPPKQPTTTTTKPIKMNEARSLFTTFLILAFLLFLLYAFYEA-----------
D.busckii         ----------------------MNEAKSLVTTFLILAFLLFLLYAFYEA-----------
*.**:.*.*** ***:***:** :*

D.melanogaster    ---------VL---------------
D.sechellia       AKSLFTTFLILAFLLFLLYAFYEAAF
D.pseudoobscura   ---------AF---------------
D.busckii         ---------AF---------------
:

The clustal format has the following requirements, which can make it difficult to create one manually. First, the first line in the file must start with the words “$\texttt{CLUSTAL W}$” or “$\texttt{CLUSTALW}$“. Other information in the first line is ignored, but can contain information about the version of $\texttt{CLUSTAL W}$ that was used to create it. Next, there must be one or more empty lines before the actual sequence data begins. The rest of the file consists of one or more blocks of sequence data. Each block consists of one line for each sequence in the alignment. Each line consists of the sequence name, defline, or identifier, some amount white space, then up to 60 sequence symbols such as characters or gaps. Optionally, the line can be followed by white space followed by a cumulative count of residues for the sequences. The amount of white space between the identifier and the sequences is usually chosen so that the sequence data is aligned within the sequence block. After the sequence lines, there can be a line showing the degree of conservation for the columns of the alignment in this block. Finally, all this can be followed by some amount of empty lines.

MAF

The Multiple Alignment Format (MAF) can be a useful format for storing multiple sequence alignment information. It is often used to store full-genome alignments at the UCSC Genome Bioinformatics site. The file begins with a header beginning with $\texttt{##maf}$ and information about the version and scoring system. The rest of the file consists of alignment blocks. Alignment blocks start with a line that begins with the letter $\texttt{a}$ and a score for the alignment block. Each subsequent line begins with either an $\texttt{s}$, a $\texttt{i}$, or an $\texttt{e}$ indicating what kind of line it is. The lines beginning with $\texttt{s}$ contain sequence information. Lines that begin with $\texttt{i}$ typically follow each $\texttt{s}$-line, and contain information about what is occurring before and after the sequences in this alignment block for the species considered in the line. Lines beginning with $\texttt{e}$ contain information about empty parts of the alignment block, for species that do not have sequences aligning to this block. For example, the following is a portion of the alignment of the Human Genome (GRCh38/hg38) $\texttt{chr22}$ with 99 vertebrates.

##maf version=1 scoring=roast.v3.3
a score=49441.000000
s hg38.chr22 10514742 28 + 50818468 acagaatggattattggaacagaataga
s panTro4.chrUn_GL393523 96163 28 + 405060 agacaatggattagtggaacagaagaga
i panTro4.chrUn_GL393523 C 0 C 0
s ponAbe2.chrUn 66608224 28 - 72422247 aaagaatggattagtggaacagaataga
i ponAbe2.chrUn C 0 C 0
s nomLeu3.chr6 67506008 28 - 121039945 acagaatagattagtggaacagaataga
i nomLeu3.chr6 C 0 C 0
s rheMac3.chr7 24251349 14 + 170124641 --------------tggaacagaataga
i rheMac3.chr7 C 0 C 0
s macFas5.chr7 24018429 14 + 171882078 --------------tggaacagaataga
i macFas5.chr7 C 0 C 0
s chlSab2.chr26 21952261 14 - 58131712 --------------tggaacagaataga
i chlSab2.chr26 C 0 C 0
s calJac3.chr10 24187336 28 + 132174527 acagaatagaccagtggatcagaataga
i calJac3.chr10 C 0 C 0
s saiBol1.JH378136 10582894 28 - 21366645 acataatagactagtggatcagaataga
i saiBol1.JH378136 C 0 C 0
s eptFus1.JH977629 13032669 12 + 23049436 ----------------gaacaaagcaga
i eptFus1.JH977629 C 0 C 0
e odoRosDiv1.KB229735 169922 2861 + 556676 I
e felCat8.chrB3 91175386 3552 - 148068395 I
e otoGar3.GL873530 132194 0 + 36342412 C
e speTri2.JH393281 9424515 97 + 41493964 I
e myoLuc2.GL429790 1333875 0 - 11218282 C
e myoDav1.KB110799 133834 0 + 1195772 C
e pteAle1.KB031042 11269154 1770 - 35143243 I
e musFur1.GL896926 13230044 2877 + 15480060 I
e canFam3.chr30 13413941 3281 + 40214260 I
e cerSim1.JH767728 28819459 183 + 61284144 I
e equCab2.chr1 43185635 316 - 185838109 I
e orcOrc1.KB316861 20719851 245 - 22150888 I
e camFer1.KB017752 865624 507 + 1978457 I

The $\texttt{s}$ lines contain 5 fields after the $\texttt{s}$ at the beginning of the line. First, the source of the column usually consists of a genome assembly version, and chromosome name separated by a dot “.”. Next is the start position of the sequence in that assembly/chromosome. This is followed by the size of the sequence from the species, which may of course vary from species to species. The next field is a strand, with “+” or “-“, indicating what strand from the species’ chromosome the sequence was taken from. The next field is the size of the source, which is typically the length of the chromosome in basepairs from which the sequence was extracted. Lastly, the sequence itself is included in the alignment block.

4.2 Phylogenetic Trees

A phylogenetic tree is a representation of the evolutionary history of a character or sequence. Branching points on the tree typically represent gene duplication events or speciation events. We try to infer the evolutionary history of a sequence by computing an optimal phylogenetic tree that is consistent with the extant sequences or species that we observe.

4.2.1 Representing a Phylogenetic Tree

A phylogenetic tree is often a “binary tree” where each branch point goes from one to two branches. The junction points where the branching takes place are called “internal nodes”. One way of representing a tree is with nested parentheses corresponding to branching. Consider the following example

((A,B),(C,D));

where the two characters $\texttt{A}$ and $\texttt{B}$ are grouped together, and the characters $\texttt{C}$ and $\texttt{D}$ are grouped. The semi-colon at the end is needed to make this tree in proper “newick” tree format. One of the fastest ways to draw a tree on the command line is the “ASCII tree”, which we can draw by using the function $\texttt{Phylo.draw_ascii()}$. To use this, we’ll need to save our tree to a text file that we can read in. We could read the tree as just text into the python terminal (creating a string), but that would require loading an additional module $\texttt{cStringIO}$ to use the function $\texttt{StringIO}$. Therefore, it might be just as easy to save it to a text file called $\texttt{tree.txt}$ that we can read in. Putting this together, we can draw the tree with the following commands:

>>> from Bio import Phylo
>>> Phylo.draw_ascii(tree)
___________________________________ A
___________________________________|
|                                   |___________________________________ B
_|
|                                    ___________________________________ C
|___________________________________|
|___________________________________ D

This particular tree has all the characters at the same level, and does not include any distance or “branch length” information. Using real biological sequences, we can compute the distances along each brach to get a more informative tree. For example, we can download 18S rRNA sequences from NCBI Nucleotide. Using clustalw, we can compute a multiple sequence alignment, and produce a phylogenetic tree. In this case, the command

4.4.2 Create a Multiple Sequence Alignment and Phylogenetic Tree with Clustal

First, use $\texttt{clustalw2}$ to align the sequences, and output a multiple sequence alignment and dendrogram file.

clustalw2 -infile=18S_rRNAs.fasta -type=DNA -outfile=18S_rRNAs.aln The dendogram file, indicated by the “$\texttt{.dnd}$” suffix, can be used to create an image of a phylogenetic tree using Biopython: First, enter the python terminal by typing “$\texttt{.python}$” and then the following: >>> import matplotlib as mpl >>> mpl.use('Agg') >>> import matplotlib.pyplot as pyplot >>> from Bio import Phylo >>> tree = Phylo.read('18S_rRNAs.dnd','newick') >>> Phylo.draw(tree) >>> pyplot.savefig("myTree.png") >>> quit() How does the resulting tree compare to what you expect given these species? 4.4.3 Create a Multiple Sequence Alignment and Phylogenetic Tree with phyML The program phyML provides much more flexibility in what sort of trees it can compute. To use it, we’ll need to convert our alignment file to phylip. We can do this with the Biopython module AlignIO >>> from Bio import AlignIO >>> alignment = AlignIO.parse("18S_rRNAs.aln","clustal") >>> AlignIO.write(alignment,open("18S_rRNAs.phy","w"),"phylip") 1 >>> quit() You can run phyml in the simplest way by simply typing “phyml” and then typing your alignment file: phyml

Enter the sequence file name > 18S_rRNAs.phy

The program will give you a set of options, and you can optionally change them. To change the model, type “+” and then “M” to toggle through the models. Finally, hit “enter” to run the program.

>>> import matplotlib as mpl
>>> mpl.use('Agg')
>>> import matplotlib.pyplot as pyplot
>>> from Bio import Phylo
How does the tree created from $\texttt{clustalw2}$ compare to the tree created using $\texttt{phyml}$?