It was 1st December 2008 when I first started working in the Sanger Institute. And now I have been here for an month (with 2 weeks holiday!) and I am still feeling very happy here. I have no complaints (other than the dreadful looking photo on my ID card) so far, and I would like to contribute some of my personal thoughts about Sanger on my dying blog.

Note: All of below are solely my personal opinion.

Things I like about Sanger

  1. It’s truly an amazing place filled with truly amazing people.
    People are really friendly here who are willing to help you during breaks. Everyone is really^inf clever and seems to know everything. There are so many interesting seminars and projects going on all the time. It’s a truly fascinating place if you like to meet people, do research, learn biology, play computers and have fun.
  2. You have a lot of data for your research
    It’s every scientist’s dream to have all the data in the world to extract information and learn something about life itself. In the first day of my work I get to assemble a few million reads of helminths genome.
  3. Resources
    pretty self explanatory: people + computing powers + data
  4. Available projects
    If you have time, there are so many possibilities to interact with different groups within the Institute. Some people I know are working on 10+ projects and they thrive on it.
  5. Think big
    I admit that my main interest is the evolution of genome, so it’s really cool to see all the latest technologies and methods to try and tackle many questions in a massive scale. I don’t know if I can go back and sequence just the one locus and study it for the rest of my life…
  6. Perl
    enough said..

there are also something I have noticed..

  1. High turnover rate of staffs
    I am not sure if it’s the fate of all major science institutes. People who come in (assuming they are reasonably clever and work hard) would usually get a position within 3 years, and they would leave the place and left unfinished projects. Usually the solution is to leave the project as it is or to advertise new position to fill the gap. But it takes time and energy (e.g, read other peoples’ directories/reports/codes without much instructions) to have the project started again.
  2. You have a lot too much data for your research
    Have a lot of data is good. But too much data require careful planning of how to handle and analyse the data with such scale. It’s also problematic when there is no consensus agreement of tackling a simple problem. This results in lots of scripts all doing essentially the same thing.
  3. Perl with no documentation
    enough said..
  4. Think big Think small?
    I am from biology background. It kind of bugs me when you have so many pipelines/scripts lying around but not much of instructions on it. I understand with so many data and so many deadlines, doing documentation on your perl (!) code is probably the last thing you want. Perhaps it will be good in the long term to start doing it.
  5. Too many meetings
    During my Phd, I can just sit down and code for the whole day, focused, and get things done! In Sanger, because each genome project is the interest of many groups, so you tend to have a lot of meetings. Even I am feeling this way, imagine what the PIs are feeling…

To conclude, I love it so far, and let’s hope my brain and hairline will cope with place like Sanger…

P.S Sanger is really short of staff at the moment. They are looking for ‘normal’ people who like to research 😛


With 3 months left til my PhD stipend runs out, I have been thinking about the next stage of my career. Looking at the job ads, the market is a bit different now. It’s good to access my situation and see what the market has to offer.

A lot of my friends who also enrolled in the PhD bioinformatics program have found their next career stage, either staying in the same lab, already working in IB/consultancy, or even going to retire (!) in S. Africa .

I’d like to stay in research. I love research. And I’d love to find a job around Cambridge, where my girlfriend is starting a PhD this Oct. If possible, I’d like to get involved in 1000 genomes or large sequencing projects. It’s an exciting time for (re)sequencing.

What I have (or I think I have) may look good

  • a distinction in MSc Bioinformatics
  • potential 4 first author papers (does it matter? I have no idea) towards the end of PhD
  • used Java extensively in the early days, but as data got messier and larger, I have been using Perl only.
  • good statistical knowledge
  • involved in various genome sequencing projects

What the market* wants (that would be really nice)

  • database design
  • software engineering experience
  • C++ or Java. (To be fair I don’t see much point in this..)
  • experience in Web applications (RoR..etc)
  • have used Bio: modules (BioPerl, BioPython) extensively
  • knowledge in server deployment

Most of my PhD involves sets of 10MB (~multiple copies of yeast genomes) of data that can be read from flat files. 10MB! I thought it was an overkill to construct a database for it. Now I regret a little, as management of database would be a big big plus in what people are looking for.

Most ads mention that you need one ‘scripting language’ like Perl and another language like c++ or Java. Personally I am not so sure about this. The vast amount of data generated in the life sciences mean any language would be too slow. Rather, I would concentrate on Perl (with a slight hope that Perl 6 will take over the world), and will learn Python (or maybe Ruby) thoroughly.

And another thing I have missed out is using the Bio: modules, for example BioPerl. BioPerl is a very powerful package, but most of the things are pretty much ‘standard’. In the end I just write everything I want fromscratch (For those of you used DNAsp before, I have rewritten almost everything in Perl..).

One good thing about bioinformatics jobs are that they are usually very specific. Database? Python? Web application? You name it. And it’s not difficult to learn and prepare. You just have to keep practicing. Reading blogs and friendfeed will give you some ideas about what skills/topics people are interested in.

So, to prepare for the jobs after PhD, I think I would contribute to some of the Bio: modules. I will learn Python thoroughly, brush some of my statistical techniques, and keep throwing CV at people.

And of course, finish writing up my thesis first.

*from jobs.ac.uk , sanger/ebi jobs, evoldir. This may be underrepresented.

Like fellow student Michael, I am also going to SMBE in Barcelona this year. I will be presenting a poster (sigh, when will I ever be able to present a talk? 😛 ).

I always loved going to conferences, perhaps even more than anyone. Why? Being in a small family business (boss no 1, boss no 2 aka boss no 1’s wife, and me), this is one of the few chances to meet and shout out my ideas. In my 2.5 years of Phd I have gone to 1 winter school, 5 workshops and 2 conferences. I have met some amazing people, though they may not remember me after all. As long as I remember them and  learnt lots of new ideas, that’s fine!

I would like to share things I learnt from going to these conferences, to help anyone making full use of conferences. First I’d like say that I am an extremely intravert person, so obviously meeting people can be a bit tricky for me. Some of the points below may be obvious to some, but they are all my personal experiences.

  1. Prepare for the conference
    Not just poster/talks (this is more like a must). Know who you want to meet. As a PhD student, I would like to meet some fellow PhD students who would be struggling to write up and are considering the next career stage. Or, you have read someone’s paper and you would like to speak to them about the paper. Or, you encountered some problem implementing someone’s model. Even a potential collaboration. Conference is surprisingly short in terms of meeting people, so be prepared.
  2. Know how to introduce yourself, at the right place and right time
    Now you know who you would like to meet. Find the person and wait for the chance to introduce yourself and ask the question. Be patient. If you want to ask the big guns you would need to join the queue. Trust me, it would be worth it (e.g., saves you much more time he explains to you than you read his paper another twenty times). And be brief when introducing yourself, and jump straight into the question (like “Hi! My name is Jason, and I got a question with regard to xxx”).

    Real case 1:
    I remembered last year some student asks a professor a question in the toilet. It didn’t turn out well..

    Real case 2:
    In my first conference I needed so badly to ask a question to a professor from Oxford, I stalked him throughout the coffee break. He was always with someone, so it would be rude interrupting them. In the end I gave up, but being such a kind person he is (or maybe he was a bit freaked out by me), he actually came to me and asks what I wanted. With his help I was able to use his work to publish my first paper.
  3. Choose the question carefully; don’t ask open-ended questions
    Conference is sort of like speed dating. You find the person who doesn’t have a lot of time for you, ask a question, and get a answer. So don’t ask questions like “what’s the meaning of life?” If you ask an interesting question and click! He would be more likely to ask you questions back! And don’t ask questions like “Can I be a post doc in your lab?”, leave that after the conference.
  4. Don’t be let down by the big results
    This probably doesn’t apply to everyone. I get disappointed by many things, like doing a poster instead of the talk. And sometimes I envy what people have come up in the conferences. These results are always amazing, with an aura radiating around them. And then you start to blame yourself, “%£$%&$%£%$3…”. Well, don’t be. One thing I realised is that you can’t do anything (although being a bioinformatician you are always constantly tempted), learn to appreciate them, and perhaps adapt their theories into your own research.
  5. No big lunch/No overdrinking
    This also gets mentioned in various articles. I know it’s hard.. but I absolutly agree, given I am a big eater. If you eat too much, you won’t have the concentration to sit the rest of the afternoon (no matter how much coffee you drink). I had some embarrassing experiences…
  6. Speak strictly to what you know
    Don’t bullshit or bluff or comment on things you have absolutely no idea of. Or don’t try to relate the conversation to something you already know. You will get your turn. This merely is a personal reflection, as I had some conversations which everything I said, the person would say “ah, this is interesting, but what I did in species x was…”, rather than open discussion of possible ideas.
  7. Take notes and follow up straight after
    Don’t wait. Tidy notes. Start new analysis straight away if you can. Start send emails. Otherwise you will forget in 2 weeks and it would be no point to waste the money to the conference.
  8. Be yourself
    Whether you are intra/extravert, geek/nongeek, fashionable/dull… be yourself. Don’t try so hard pleasing others. Again, people know and you would make them uneasy. Everyone’s interesting in their own way, so just be yourself. If you don’t like being with the people you form group with, go to another one. Or rather, go back and tidy your notes. Make yourself useful if you think you can’t cope with some people.
  9. Enjoy!
    You get to visit a new city. You are meeting people whose papers/textbooks you have read throughout your research. You meet your own peers who are also struggling to work/write thesis/papers. People are friendly (overfriendly I would say) and critise you with no hard feelings. Even sometimes (only sometimes) you get a compliment from someone saying your research is interesting, that would make it two! What’s not to enjoy?

This probably doesn’t apply to everyone, especially if you have already someone in the group to go with. I am glad that this year I actually know quite a few people (yes, even the professor I stalked would be in this conference). Hence, enjoy when you are in a conference!

After a long wait my first paper is finally out. The title is “Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle“. It is open access and can be downloaded here.


Most microbes have complex life cycles with multiple modes of reproduction that differ in their effects on DNA sequence variation. Population genomic analyses can therefore be used to estimate the relative frequencies of these different modes in nature. The life cycle of the wild yeast Saccharomyces paradoxus is complex, including clonal reproduction, outcrossing, and two different modes of inbreeding. To quantify these different aspects we analyzed DNA sequence variation in the third chromosome among 20 isolates from two populations. Measures of mutational and recombinational diversity were used to make two independent estimates of the population size. In an obligately sexual population these values should be approximately equal. Instead there is a discrepancy of about three orders of magnitude between our two estimates of population size, indicating that S. paradoxus goes through a sexual cycle approximately once in every 1,000 asexual generations. Chromosome III also contains the mating type locus (MAT), which is the most outbred part in the entire genome, and by comparing recombinational diversity as a function of distance from MAT we estimate the frequency of matings to be ~ 94% from within the same tetrad, 5% with a clonemate after switching the mating type, and 1% outcrossed. Our study illustrates the utility of population genomic data in quantifying life cycles.


Tsai I.J, Bensasson D, Burt A, and Koufopanou V
Population genomics of the wild yeast Saccharomyces paradoxus: Quantifying the life cycle
PNAS 2008 : 0707314105v1-0.

A month ago Wong et al [1] brought up the issue of sequence alignment uncertainty and spanned wide interests. They studied 1502 sets of orthologous gene sequences from 7 yeast species, and aligned the sets with 7 mostly used alignment programs. This produced 7*1502 alignments.

When trying to estimate the phylogenies of these alignments, 46.2% of the 1502 sets yielded one or more differing trees (out of possible 7). The inconsistency of the trees was caused by the different algorithms in the programs. The main approach to tackle this kind of problem is by filtering out the ‘ambiguous’ parts, but this will cause too much of the primary data being excluded. Those actual informative substitutions will also likely to be removed. Not just phylogeny studies, but like other tests of selection or population genetics parameters, depends on the alignment. Only ~9% of ORFs that show signature of positive selection (from dn/ds ratio) from the sets are consistent in all 7 alignments, while the rest were sensitive to the method of alignments.

Rokas [2] had a summary about the study, emphasizing that perhaps it’s not the programs’ prone to errors. The genes and the new sequences are just harder to align (as opposed to all the ones that were studied before because they were easy to align). Perhaps there is no single alignment, rather a distribution of alignments that served as a prior. Thirst for science has given a basic summary of the emerging statistical procedures to tackle alignment uncertainty, and Thomas Mailund has a bit more details.

About few weeks later Margulies [3] also published a new commentary raising the same issue again, this time from observations in recent studies and a new study from Lunter et al [4]. According to Lunter et al., more than 15% of aligned bases between current human-mouse genome-wide alignments are incorrect. Attempts to improve the alignment by making indels more evolutionary realistic have only shown modest improvement. It seems that the alignment errors could not be easily resolved, again reinforcing a need for a probabilistic formalism on multiple sequence alignments.

Putting alignment uncertainty problem aside, there is also another related fundamental problem: sequencing errors. These sequencing errors will obviously affect the alignment. Logically it seems wrong to align with possible erroneous bases first then mask them into missing data after the alignment is created. Nevertheless it would be also be more complicated to mask the low quality sequences into missing data then align the sequences (a fifth base instead of four?). At the emergence of multiple genomes resequencing project, the consideration of base quality and alignments correctness need to be incorporated into all genomic studies.

Other bloggers’ comment here, and here.


1. Wong, K.M., Suchard, M.A., Huelsenbeck, J.P. (2008). Alignment Uncertainty and Genomic Analysis. Science, 319(5862), 473-476. DOI: 10.1126/science.1151532

2.Rokas, A. (2008). GENOMICS: Lining Up to Avoid Bias. Science, 319(5862), 416-417. DOI: 10.1126/science.1153156

3. Marguiles E.H. (2008) Confidence in Comparative Genomics 18(2):199-200 DOI: 10.1101/gr.7228008

4. Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. (2008) Genome Res. 18(2):298-309 DOI: 10.1101/gr.6725608

I finally have a chance to reread the autobiography by Emanuel Derman, about his personal reflections on a physicist turning into a quant. It’s one of the few books out there where scientists write about the struggles between your interests and a reality of job that actually pays. The book consists of three parts: his brilliance in physics but failed to compete, his difficult transition from physics into finance, and his viewpoints on the finance world.

I was 33 years old and halfway through my postdoc; where was this peregrination going to end?…. On the day in 1978 I suddenly found myself flirting with the idea of going to medical school… Physics is a harsh meritocracy. Most of the merit is concentrated in a small number of legendary figures…. if you aren’t Feynman, you’re no one. A competent, but not brilliant, research physicist had little to feel good about; who needs what you provide?

This is exactly what most scientists feel when they are faced with the harsh reality of living, your love to research but end up going through the eternal cycle of post docs if you weren’t good/novel enough. He then described his feelings directly, both shame and pride, of working as a quant in the finance world. Though a true physicist at heart, his roles differ from building applications, supporting traders and tinker models for different financial products. The book also described the history of how Black-Scholes model was emerged for those who have an interest on how model was used in the finance world. The style of writing is casual and informal, which proved very easy to read.

The audiences of this book are really people who want a peek into finance. What interests me was how he sees models in the two worlds. A financial theory, he quoted from Fischer Black, was

a theory is accepted not because it is confirmed by conventional empirical tests, but because persuade one another that the theory is correct and relevant

Sounds like dn/ds model? :p It is almost like bioinformatics, a clash between biology and mathematics…
What interests me the most, is his struggle in academia and difficult transitions to quant. I am finishing my PhD and still at mid twenties, yet I already have the same feeling as he writes in the final chapter of his book


Being a scientist can sometimes be depressing. Surrounded by younger versions of yourself, you are constantly confronted by the mismatch between the dreams of youth and the facts of maturity

Don’t we all?

From Phdcomics

Ever so often when I am asked about what exactly I am doing in my Phd, I always think for quite some time as if I had no idea. I start to notice this behaviour increasing popular among my peers as well when they get given the same question. This is not because we don’t know what we are doing, but rather we do not know how to characterise ourselves.

I am currently enrolled as a Phd student in bioinformatics, and this is what I do in a typical day:

What I do (grant version)

Understanding the evolutionary forces that shape the genomic variations between and within different yeast species. A day of work would involve categorising and analysing polymorphism/divergence and estimate parameters that would explain the effectiveness of different evolutionary forces acting on the DNA level.

What I really do:

  1. I code and make tables. I have a DNA sequence alignment since 2005. I have been trying to extract everything from the alignment for 2 years and counting. Whatever you can think of with a DNA alignment, I have done it all (yes, been there, done that). This either involves making the alignment into a SNP dataset, feed into some program someone had published and hope the results come out fine. Or the program no longer works/does not suit slightly/no longer being maintained, and you write yourself one. (Current achievement: writing almost every function of DNAsp with missing data in Perl) In the end you have a big big big table. (time spent: 5 minutes ~ days)
  2. I make graphs. big big big table -> human readable graphs. Depending on how much money your boss have greyscaling the graph can be a real…. (time spent: 5 minutes ~ hours if publishing deadline approaching )
  3. I read papers. This will take ages. Papers with good results often mean you have to dig the real methods from supplementary materials. Reading the papers as if you are the reviewer. (time spent: depending on the level of kindness of the authors)
  4. I interpret the results. This is what/why you work hard for. (time spent: 1 minute)
  5. I re-run everything I have done in the last 2 years, in 30 minutes. Results are fascinating, and you have to confirm it. Sure! Re run everything you have done before, and realising your Phd is in effect, 30 minutes of work if someone has published a good program to calculate everything (time spent: 30 minutes and days questioning yourself)
  6. I write. Results are fine and confirmed. Now I need to make sure everyone (aka supervisor) understand it. (time spent: depends)
  7. I moan. A problem people can realise from the list above is that I have not done anything novel. The results are novel, but all the methods are old. Is that a PhD? I do know for a fact that I am not qualified for most jobs currently advertising, something like the desired candidates would need to design novel statistical tests. (time spent: forever)

(tea breaks/supervior meetings/seminars/daydreaming/blog reading not included)

This is my typical day of work, and I would call myself a genomic data analyst with experiences of dealing large genome datasets and good statistical knowledge and very quick coding hands and a very fundamental understanding of biochemistry/molecular biology and know a bit of everything (note: this is not a self promotion post). This is what we do: we are biologists that have a lot of and in our roles.

And personally I have the integrity to make sure my paper is not just another genome analysis paper that publish summary statistics and mention future directions. I make sure other scientists can repeat my analysis in 30 minutes with my alignment or any well established alignment and see the significance of it.

So what exactly do you do?