A month ago Wong et al [1] brought up the issue of sequence alignment uncertainty and spanned wide interests. They studied 1502 sets of orthologous gene sequences from 7 yeast species, and aligned the sets with 7 mostly used alignment programs. This produced 7*1502 alignments.
When trying to estimate the phylogenies of these alignments, 46.2% of the 1502 sets yielded one or more differing trees (out of possible 7). The inconsistency of the trees was caused by the different algorithms in the programs. The main approach to tackle this kind of problem is by filtering out the ‘ambiguous’ parts, but this will cause too much of the primary data being excluded. Those actual informative substitutions will also likely to be removed. Not just phylogeny studies, but like other tests of selection or population genetics parameters, depends on the alignment. Only ~9% of ORFs that show signature of positive selection (from dn/ds ratio) from the sets are consistent in all 7 alignments, while the rest were sensitive to the method of alignments.
Rokas [2] had a summary about the study, emphasizing that perhaps it’s not the programs’ prone to errors. The genes and the new sequences are just harder to align (as opposed to all the ones that were studied before because they were easy to align). Perhaps there is no single alignment, rather a distribution of alignments that served as a prior. Thirst for science has given a basic summary of the emerging statistical procedures to tackle alignment uncertainty, and Thomas Mailund has a bit more details.
About few weeks later Margulies [3] also published a new commentary raising the same issue again, this time from observations in recent studies and a new study from Lunter et al [4]. According to Lunter et al., more than 15% of aligned bases between current human-mouse genome-wide alignments are incorrect. Attempts to improve the alignment by making indels more evolutionary realistic have only shown modest improvement. It seems that the alignment errors could not be easily resolved, again reinforcing a need for a probabilistic formalism on multiple sequence alignments.
Putting alignment uncertainty problem aside, there is also another related fundamental problem: sequencing errors. These sequencing errors will obviously affect the alignment. Logically it seems wrong to align with possible erroneous bases first then mask them into missing data after the alignment is created. Nevertheless it would be also be more complicated to mask the low quality sequences into missing data then align the sequences (a fifth base instead of four?). At the emergence of multiple genomes resequencing project, the consideration of base quality and alignments correctness need to be incorporated into all genomic studies.
Other bloggers’ comment here, and here.
_____________________________________________________________________
1. Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532
2.Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156
3. Marguiles E.H. (2008) Confidence in Comparative Genomics 18(2):199-200 DOI: 10.1101/gr.7228008
4. Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. (2008) Genome Res. 18(2):298-309 DOI: 10.1101/gr.6725608