Archive for the ‘papers review’ Category

After a long wait my first paper is finally out. The title is “Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle“. It is open access and can be downloaded here.


Most microbes have complex life cycles with multiple modes of reproduction that differ in their effects on DNA sequence variation. Population genomic analyses can therefore be used to estimate the relative frequencies of these different modes in nature. The life cycle of the wild yeast Saccharomyces paradoxus is complex, including clonal reproduction, outcrossing, and two different modes of inbreeding. To quantify these different aspects we analyzed DNA sequence variation in the third chromosome among 20 isolates from two populations. Measures of mutational and recombinational diversity were used to make two independent estimates of the population size. In an obligately sexual population these values should be approximately equal. Instead there is a discrepancy of about three orders of magnitude between our two estimates of population size, indicating that S. paradoxus goes through a sexual cycle approximately once in every 1,000 asexual generations. Chromosome III also contains the mating type locus (MAT), which is the most outbred part in the entire genome, and by comparing recombinational diversity as a function of distance from MAT we estimate the frequency of matings to be ~ 94% from within the same tetrad, 5% with a clonemate after switching the mating type, and 1% outcrossed. Our study illustrates the utility of population genomic data in quantifying life cycles.


Tsai I.J, Bensasson D, Burt A, and Koufopanou V
Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle
PNAS 2008 : 0707314105v1-0.

Read Full Post »

A month ago Wong et al [1] brought up the issue of sequence alignment uncertainty and spanned wide interests. They studied 1502 sets of orthologous gene sequences from 7 yeast species, and aligned the sets with 7 mostly used alignment programs. This produced 7*1502 alignments.

When trying to estimate the phylogenies of these alignments, 46.2% of the 1502 sets yielded one or more differing trees (out of possible 7). The inconsistency of the trees was caused by the different algorithms in the programs. The main approach to tackle this kind of problem is by filtering out the ‘ambiguous’ parts, but this will cause too much of the primary data being excluded. Those actual informative substitutions will also likely to be removed. Not just phylogeny studies, but like other tests of selection or population genetics parameters, depends on the alignment. Only ~9% of ORFs that show signature of positive selection (from dn/ds ratio) from the sets are consistent in all 7 alignments, while the rest were sensitive to the method of alignments.

Rokas [2] had a summary about the study, emphasizing that perhaps it’s not the programs’ prone to errors. The genes and the new sequences are just harder to align (as opposed to all the ones that were studied before because they were easy to align). Perhaps there is no single alignment, rather a distribution of alignments that served as a prior. Thirst for science has given a basic summary of the emerging statistical procedures to tackle alignment uncertainty, and Thomas Mailund has a bit more details.

About few weeks later Margulies [3] also published a new commentary raising the same issue again, this time from observations in recent studies and a new study from Lunter et al [4]. According to Lunter et al., more than 15% of aligned bases between current human-mouse genome-wide alignments are incorrect. Attempts to improve the alignment by making indels more evolutionary realistic have only shown modest improvement. It seems that the alignment errors could not be easily resolved, again reinforcing a need for a probabilistic formalism on multiple sequence alignments.

Putting alignment uncertainty problem aside, there is also another related fundamental problem: sequencing errors. These sequencing errors will obviously affect the alignment. Logically it seems wrong to align with possible erroneous bases first then mask them into missing data after the alignment is created. Nevertheless it would be also be more complicated to mask the low quality sequences into missing data then align the sequences (a fifth base instead of four?). At the emergence of multiple genomes resequencing project, the consideration of base quality and alignments correctness need to be incorporated into all genomic studies.

Other bloggers’ comment here, and here.


1. Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532

2.Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156

3. Marguiles E.H. (2008) Confidence in Comparative Genomics 18(2):199-200 DOI: 10.1101/gr.7228008

4. Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. (2008) Genome Res. 18(2):298-309 DOI: 10.1101/gr.6725608

Read Full Post »