•   
  •  
  •  
  • Login
  • |
  • Register
RSS Feed Print
How much are we missing by just looking at SNPs and hg19
Jeffrey Rosenfeld
Posted: Friday, January 27, 2012 8:43 AM
Joined: 10/30/2011
Posts: 4


I would like to begin a discussion concerning two distinct but related topics in reference to genome variant detection.

1. How much variation is there in genomes that we are not seeing because we are strictly looking at variants relative to a reference?  Are there significant amounts of non-reference sequence in genomes that we are missing, or are there rearrangements that are so dramatic they are difficult to determine by just using a reference?

2. How much of the variation in the genomes is in variants more complex than SNPs that have not been extensively profiles such as MNPs (multiple nucleotide polymorphisms), overlapping out-of-phase indels and SNPs, or multiple nearby SNPs affecting the same protein?  Are there cases where a gene will have 5 SNPs that need to be investigated together rather than just running SIFT or Polyphen on each of them separately?


Simona Federica Maria Gaudi
Posted: Tuesday, January 31, 2012 3:21 AM
Joined: 9/12/2011
Posts: 1


I'm wondering how many LINEs, HERVs and Alu are correctly described in the reference sequence (hg19). 

I personally think that all these informations are important in deciphering the genomics of complex diseases. 

Simona

simona.gaudi@iss.it


Zamin Iqbal
Posted: Monday, February 6, 2012 5:10 AM
Joined: 1/13/2012
Posts: 5


Hi Jeff

A very recent paper in Nature Genetics (of which I am an author) gives some quantitative answers to both these questions. By using diploid de novo assembly of multiple individuals simultaneously, and calling variants directly between samples without

any regard to the reference (and then validating by demanding nucleotide-level agreement with finished fosmids),  we show that there is a rich seam of variation which is invisible to reference-based approaches. 

Re: how much are we missing because we use a reference genome. This breaks into two parts

1. We might miss something because the reference is incomplete (an assembly error) - which is not such a big problem with human genetics as with other specied, OR because no single haploid genome can contain all the variation in a diverse population.  By assembling over 150 individuals from the 1000 Genomes pilot into 3 "population graphs", we not only

found over 3Mb of sequence that is extremely diverged from the reference (effectively missing, as no mapper will know what to do with it), but also showed that there is a significant amount of gene sequence within it, some of which is highly differentiated between populations

 (see Figure 4 in our paper). The 3Mb is not supposed to be an accurate estimate of the total missing sequence - this was based on old 1000 Genomes data (36bp reads),

so the power is low. However there clearly IS a significant amount of variation out there which reference-based methods are missing

2. We might miss something because it is hard for a mapper to be able to correctly place reads, or a mapping-based caller to interpret even correctly based reads.  This is indeed a problem, for example it is common practise now to discard SNP calls near indels. If you take a look Figure 3 in our paper you see a very striking comparison of the 1000 Genomes calls on a high coverage sample versus our assembly based calls (fosmid validation showed a low FDR (~3%)). This shows how assembly is much more powerful at discovering the larger/more complex indels. There is a wide range of calls which new methods are now opening up to us, including very complex combinations of SNPs and indels (See Figure 3b for an example)

I think my take-home message is that as we get more ambitious and want to dig in more deeply into human variants, we need to start using unbiased methods of discovery, such as de novo assembly which are able to capture a much wider range of variants. We're coming into an exciting time where software and statistical methods are catching up with the advances offered by sequencing technologies! For example, an obvious application of being able to assemble and compare multiple samples at the same time, is in looking at trios where the child has a phenotype that the parents do not. One can simply pull out of the assembly those sections where the child differs from both parents.

Anyway - I think you raised a timely question, and that much will happen in this area this year!

You can access our paper here:

Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics (2012)

http://dx.doi.org/10.1038/ng.1028

and our variant calling software here

http://cortexassembler.sourceforge.net/index_cortex_var.html

regards


Zamin Iqbal

-----------------------------

www.well.ox.ac.uk/~zam


Jeffrey Rosenfeld
Posted: Monday, February 6, 2012 2:40 PM
Joined: 10/30/2011
Posts: 4


Hi Simone,

That is a great question, but I don't know how well we can answer it.  There are two issues:

1. What is meant by "correct" each of these types of sequences greatly vary between people and there is no true correct answer.  Even for the reference individual, there are probably problems with the estimates of their repeat sequences.

2. The alignment of NGS reads is generally dependent upon unique sequence in the genome, so it is very hard to align reads to these repetitive locations.  Therefore most re-sequencing projects are probably not calling many variations in these repeats.

Jeff


Jeffrey Rosenfeld
Posted: Monday, February 6, 2012 2:49 PM
Joined: 10/30/2011
Posts: 4


Hi Zamin,

Thank you for your very informative post.  I have not read your paper yet, but it looks very interesting.  I assume that the tool you developed is fairly computationally intensive and cannot easily be integrated into common analysis pipelines.  What do you think is the best option for generally refining the variant calling?  One attractive option is the way that Complete Genomics performs a local de novo assembly around potentially complex regions and then calls variants.  But, the CG approach does not address the case of sequences that lack any match in the reference.

Jeff


Zamin Iqbal
Posted: Tuesday, February 7, 2012 4:48 AM
Joined: 1/13/2012
Posts: 5


Hi Jeff

> I assume that the tool you developed is fairly computationally intensive and 

> cannot easily be integrated into common analysis pipelines.

Actually that's not at all the case. My variant assembler (called Cortex) is in some sense less computationally demanding than any other assembler (it depends a bit on how you measure it). It's been used by collaborators for over a year now, and it turns out not to be too hard for others to integrate it into their pipelines. Basically, Cortex allows you to take the output of sequencing machines and get a VCF, which can immediately fit into standard analyses.

For people studying microbes, you can analyse thousands of isolates simultaneously. For those doing large eukaryotes like humans it's possible to study families.

It's a bit like petrol (gas) consumption in cars. You make a car more efficient and people start driving further. If you want to compare tumour-normal or parents and children, then you can now assemble them simultaneously - this is a completely new step for assembly, where so far people have been struggling to do a single haploid assembly, not multiple diploid ones. 

>What do you think is the best option for generally refining the variant calling?

It depends what you want and where you want to be on the sensitivity/specificity curve, but 

I think assembly has to be part of the solution. Mapping will never get you predictable, reliable results in complex regions.

> One attractive option is the way that Complete Genomics performs a local 

> de novo assembly around potentially complex regions and then calls variants.

Yes, this is a viable approach - the risk is of course that your assembly is affected by bad choices by the mapper. Noone has yet done a good study (as far as I am aware) doing this and comparing with the variant calls you get by global assembly to show what you gain and what you don't. So I just don't know how well it works compared with going the whole hog and doing the whole genome like I do. 

>But, the CG approach does not address the case of sequences that lack any match in the reference

Nor does it (as far as I can tell) attempt to call the complex variants on the scale that you can by doing whole genome variant assembly. However I haven't done a thorough analysis of the CG results. 

I do think with all of this stuff, as you move into calling harder variants, you need to do really thorough validation, demanding base-level agreement and precise coordinates. I know it's expensive, but fully sequenced fosmid validation is extremely valuable. That's why the fosmids Evan Eichler's team produced for NA12878 are so valuable. Given these resources have been made available, I think it's up to writers of variant callers like me to use them to really prove the accuracy of their calls.


Zamin Iqbal
Posted: Tuesday, February 7, 2012 5:22 AM
Joined: 1/13/2012
Posts: 5


I should add - I'm not claiming to be able to assemble all the crazy things that happen in tumours (yet). But I am saying that by jointly assembling tumour and normal, one can just ignore all the parts of the assembly where the data looks the same, and just focus directly on the differences.
Jeffrey Rosenfeld
Posted: Thursday, February 9, 2012 12:58 PM
Joined: 10/30/2011
Posts: 4


I have some experience with Complete Genomics, and they seem to be the the best at dealing with complex variants in the genome.  An example is a case where there is a SNP on one strand, and an indel on the other strand.  CG will take all of the reads overlapping the region and do a de novo assembly of them.  This allows them to properly call the variants.  I am not aware of other variant callers that take this into account.  Most variant callers just call indels and SNPs and don't consider the interactions.  Another area where the local assembly appears to help is when there are multiple nearby SNPs.

Jeff


Zamin Iqbal
Posted: Friday, February 10, 2012 7:08 PM
Joined: 1/13/2012
Posts: 5


Yes, we're talking calling about exactly the same events, in exactly the same way!
Zamin Iqbal
Posted: Friday, February 10, 2012 7:09 PM
Joined: 1/13/2012
Posts: 5


..well, almost exactly
 
Please login to post a new topic or reply to an existing post.
Privacy Policy|Terms of Use