• Login
  • |
  • Register

NGS Leaders Blog

Assembly Required: Lessons in De Novo Genome Assembly

 Permanent link

de novo genome assemblyDecember 14, 2011  

Kevin Davies :  Last week’s NGS Leaders webinar on De Novo Assembly of Complex Plant and Animal Genomes prompted more than 300 scientists and informaticians to pre-register, which speaks to the ubiquity and challenge of assembly complex genomes using short reads. Our three speakers – Mario Caccamo (The Genome Analysis Centre, UK), Ian Korf (UC Davis) and Jeffrey Rosenfeld (UMDNJ), outlined several key pointers in tackling complex genome assembly.

“Every genome has its own story in terms of repeats,” says Ian Korf, associate director of bioinformatics at the University of California Davis Genome Center. Korf is one of the principal organizers of  the Assemblathon - a competition to identify best practices in the de novo assembly of complex plant and animal genomes.

Results of the first phase of the Assemblathon were recently published in Genome Research. You can also read more about the Assemblathon at Bio-IT World.

“Every genome is a complex genome - even the simpler ones are pretty complex. There’s no easy genome,” said Korf.

The Assemblathon participants – 17 groups in all - were challenged to assemble a synthetic chromosome of some 96 million bases. Commenting on the results, Korf said: “A lot did a pretty good job, but it’s more difficult to assemble regions with more mutations, so the coding regions were assembled better than non-coding regions.”

The assemblies were ranked by various criteria, including contig and scaffold paths, structural and copy number errors. The top five entrants emerged as:

 -Broad Institute (ALLPATHS-LG)
 -BGI (SOAPdenovo)
 -Wellcome Trust Sanger Institute (SGA)
 -DOE Joint Genome Institute (Meraculous)
 -Cold Spring Harbor Lab (Quake, Celera, Bambus2)

Several useful tools emerged, says Korf, but experience in using the tools makes a big difference. “We found that sometimes two groups will use the same assembler, but the group that knows a bit more about the assembler might do a slightly better job. It’s something of an art at this point,” said Korf.

Choose Wisely 

Korf says that wisely choosing the many different parameters involved in de novo genome assembly is difficult and “probably shouldn’t be attempted by amateurs.” He advises inexperienced users to “contact one of the major sequencing centers and get them to help you. Doing it on your own is pretty much guaranteed to give you a sub-optimal assembly… Don’t jump into genome assembly thinking it’s just like any other bioinformatics problem you can hack with some Perl scripts.”

It starts as far upstream as DNA library preparation. “You don’t want to choose the assembler as the last thing you do,” says Korf. “It must be in conjunction with the sequencing technology, how are the libraries made, the full equation. You can’t do it stepwise… So much is dependent on having high quality sequence and making your libraries correctly.”

Another wise move, says Korf, is to perform a pilot project to explore the content and gauge the overall repeat content of the genome in question. “You should do a little homework ahead of time to get an idea of GC content and other factors,” says Korf.

The availability of longer read lengths, such as those produced by the Pacific Biosciences platform, should prove a boost for genome assemblies. “The long reads are fantastic, but the error rate is a bit of an issue,” said Mario Caccamo, head of bioinformatics at TGAC and a fellow co-author of the Assemblathon I report.

But Korf says the PacBio reads can prove very useful in integrating with short read data: “Genome assembly with longer reads will get much, much easier. The game will be completely changed with reads on the order of 10 kilobases.”

Korf believes the NGS community - “super smart people, full of competitive spirit” – will figure out how to use these 3rd-generation technologies. “Right now, they haven’t had enough time to figure out how to put it all together, but they will pretty soon,” he says. “What you’ll get three years from now will be a lot better than today.”

Clearly audience response suggests we revisit this topic in the near future. What NGS-related topics would you like to see presented in a free NGS Leaders webinar?   Email Janine with suggestions.

I-Study: Genomic Interpretation - Who Will Pay?
During this webinar, members of the study review team present preliminary findings of the I-Study, conducted at the Harvard Medical School's 2011 Personalized Medicine Conference.
Twitter Feed
Privacy Policy|Terms of Use