By Francisco M. De La Vega
After months of anxious waiting, I finally received my exome data from the 23andMe Exome Pilot about a week ago. When I enrolled last year – 23andMe offered a $999 personal exome to existing subscribers -- I was quite excited to join the ranks of those that had their genome sequenced, at least armed with my exome – the 1% of the genome that are genes. With the 30,000+ genomes reportedly sequenced to date, and probably even more exomes, this is nowadays a dubious distinction. However, I wanted to do this on my own terms -- not being kept hostage to research protocols that usually forbid that research subjects receive their results. And I was really curious to see if there is anything medically relevant to learn from such data given that I am more or less “healthy.” Well, the moment finally came.
The file download took quite some time -- it was big. Raw reads were included, which I will align and make variant “calls” at a later date to see if there are discrepancies. Fortunately, I already had installed the tool that 23andMe requires to unpack and mount the file, and I quickly browsed through the report provided. This short report showed just a few examples of interesting variants, hardly a satisfying result. There is no deception here: these were just the expectations set by 23andMe for this pilot from the beginning.
Knowing this, I had prepared to take matters into my own hands, so I was ready for the next step. For this analysis, I previously secured a beta testing account for the genome/exome analysis tool that Omicia, a start up in genome interpretation from Emeryville CA, has been developing and testing in recent months (disclosure: I advise for Omicia). Using this tool, I was already familiar with the genomes of other people -- the famous first ten genome pioneers(1), the Complete Genomics diversity set and other genomes from ancient individuals that I was analyzing for a joint research project. However, this time it was my own exome! The anticipation was building as I logged into the Omicia beta test server and tried to speed through the variant submission process.
I located the file supplied by 23andMe containing my genetic variants (the VCF file), selected it in the genome upload dialogue and, in a minute, my file was accepted and submitted to the Omicia annotation pipeline. After just 30 minutes, I was ready to explore my exome using Omicia’s "variant mining" report. All of my variants were annotated with the following information:
- chromosome location
- the corresponding gene
- depth of coverage and quality of the calls
- the class of variant as it relates to corresponding protein (synonymous, nonsynonymous, splice, stop gain/loss, frameshift)
- the change in amino acid (if any)
- various scores attempting to evaluate the damage the variant might cause in the protein product (SIFT, PolyPhen, MutationTaster, PhyloP, and Omicia’s own algorithm)
- allele frequencies around the world (if available), and
- cross-references to entries in databases that collect genotype-phenotype relationships of medical significance (OMIM, HGMD, PharmGKB, GWAS hits).
In total, close to 9,000 of my variants had some protein impact. Of these, more than 500 had some database annotation of medical or functional relevance. How could someone live with that many protein-changing variants?! Well, it is now clear that this is pretty normal for human beings. It looked like it was going to be a big task to figure out which of these really could be relevant for me.
After browsing some of the annotations, a surprise -- I recognized one of the genes right away: ATG16L1. I was homozygous for a disease-associated variant that I helped to discover! (The report even included the link to my own paper(2).) Wow! Even if in principle I should be worried because I carry two bad copies of that gene mutation (a known deleterious variant), somehow I felt amazed to be reacquainted with this old friend. Besides, this variant was associated with Crohn’s disease, a gastrointestinal disease that I certainly don't have and at my age it is unlikely I will ever have.
This is a not uncommon situation with variants associated with common/complex disease -- they point to relevant genes and pathways, and may lead to new understanding of the disease or even new therapies, but per se have low predicting power in an individual. Obviously the environment plays a significant role -- perhaps exposure to certain intestinal disease bugs in my childhood helped me? But can this explain some of the problems one of my children has experienced?I am also homozygous for a rare, probably deleterious variant in NOD2, a gene also associated with the disease,. Could this be a double hit? Maybe. However, these are just a couple of about 75 genes associated reproducibly with this disease. I created a list of these genes from a recent review, saved as a personal gene set, and filtered my data again. I have 27 variants in these genes, 3 of which seem severe. Now what? This is how my exploration began. I won't forget that day -- I was really excited.
In order to focus my search, I first searched for homozygous variants previously associated with disease (as having two copies of a bad gene more likely results in disease). This was easily accomplished by selecting for homozygous genotype in the filters and restricting to evidence from the Online Mendelian Inheritance in Man database (OMIM). The quality of the genotype call is important, because if the depth of coverage is low (e.g. 6x), chances are that a homozygote call could in reality be a heterozygote that was just not sampled enough during the random shotgun sequencing.
Filtering by coverage (>15x) was quite easy to do, although interestingly it removes 33% of the variants. This analysis yielded some interesting hits, but (luckily) nothing really worrying -- just some susceptibility variants. One was a variant associated with drug addiction: FAAH. It exhibited a large odds ratio for drug addition susceptibility in some small studies (so confidence is not very high), but thankfully it may not be associated with alcohol abuse and thus my occasional wine consumption is hopefully innocuous.
Next, I looked for heterozygous variants in these genes, which could mean I am carrier of the disease (I relaxed the minimum coverage to 6x to recover more variants.) I found the same carrier variant that 23andMe previously reported from my genotypes, but now I see a few more autosomal recessives. This was interesting, but nothing really bad appeared. A variant in the Factor 9 gene that has been associated with reduced risk for deep vein thrombosis -- good news perhaps for a frequent flier. Other variants were just phenotypic polymorphisms (e.g. hair thickness, eye, skin color) but other disease-associated variants are difficult to interpret as at times they increase risk and, at other times, they decrease risk. How do you add this all up?
Top Ten Lists
Omicia also allows filtering variants by collections of genes curated by their association to disease. One of these collections is the "top 10" list, that includes the most actionable genes for a given disease/area (e.g. Alzheimer's, cancer, Parkinson’s, cardiology, epileptics, psychiatric, aging, respiratory) selected by a panel of experts. I mostly focused on variants with high probability of being damaging by using the impact scores. Most disease areas came up empty, with a few heterozygous exceptions and thus less worrisome. One of these exceptions was a likely deleterious variant in a gene involved in Alzheimer’s disease, A2M. Now, evidence suggests this gene is involved in the clearance of components of the beta-amyloid deposits, which could make the disease to advance faster, but there are no reports linking this to the onset of the disease.
Some advice here: beware of large genes such as BRCA1 or HTT (the Huntington’s gene), which frequently carry benign heterozygous variation that have not been associated with disease. When I looked at HTT, I was briefly concerned, but the variant I carry was not the repeat expansion associated with Huntington’s disease (which would be very difficult to identify with the current technology), so this is likely just a benign variant. In addition, I harbor some heterozygous benign variants in BRCA1, but these are common in the population.
There is also a set of highly polymorphic genes that often show damaging variants, such as olfactory receptors, mucins, etc., which are rarely interesting, so filtering them out is easily done. Also beware of disease associated variants in homozygous state where the alternative allele (different from the reference) is the common one on the population -- this just means that the reference genome assembly carries the rare or disease associated allele and illustrates a problem with the standard analysis where only discrepancies to the reference are reported as variants. You can filter those by excluding variants with alternate allele frequency between 0.5 and 1.
Next, I wanted to see if more rare, novel variants that I inherited(3) could damage genes of medical relevance. Omicia has curated sets of genes linked by literature to medical specialties as in the index of the Harrison Textbook of nternal Medicine. This is a much bigger list of genes, so I restricted my search again to likely deleterious variants, filtered by allele frequency to variants of less than 5% in global populations (those without frequency information are considered zero). Here, many more genes surface, mostly from a variety of phenotypes. The big challenge here is how to evaluate new variants that are not specifically associated with disease, but could be more damaging to medically important genes? This is where I will spend much more time, chasing these variant while awaiting new methods to deal with this conundrum.
Finally, a dream comes true. Since my early days as a geneticist I experimented with mutations in the genome of the bacteria E. coli that introduce a premature termination signal in a gene instead of an amino-acid, and result in truncated proteins when the gene is translated. This often results in loss of a critical function, say for example, needed by a bacteriophage to infect the bacteria and grow. These mutations were so special they carried almost mystical names: amber, opal, ochre(4). Since then, I have wondered when I could do the first real experiment on my genome: Do I carry any “loss of function” mutations in my genome in any of the 20,000 genes?
This list must be short. Daniel MacArthur and collaborators from the 1000 Genomes Project recently published a nice paper in Science about the abundance of LoF variants among human populations(5), and here I am, just weeks later, doing the same analysis in my exome in a matter of minutes. How cool is that?!
In total my exome carried 41 stop gained variants, 5 of them homozygous. This revealed two additional potential carrier mutations –- the stop was ahead of previously reported rare autosomal recessive diseases. I fervently read up the direct link to the primary literature provided by Omicia. At first sight, there is probably nothing to worry about for me -- but this is information worth sharing with my children.
An additional complexity in this analysis is that I am from mixed heritage, with ancestors originating in different continental populations that were isolated for a long time and mixed relatively recently. What is the effect of this mixed ancestry in my susceptibility to disease? Most disease studies have been carried out in populations of European descent. Therefore, there may be novel susceptibility or protective variants in other continental populations that suffered different population bottlenecks, expansion, and perhaps even adaptations to new environments. A rare variant in Europeans could be common in other continents due to genetic drift. And if rare variants in my immediate "clan" have a stronger influence in my health, these are less likely to be shared between populations(6). Again, more data is needed and I just hope that “GWAS fatigue” doesn’t kill the studies in other continental populations that would be helpful to illuminate the non-European part of my genome, as has been suggested by Carlos Bustamante, Esteban Burchard and myself (7).
At this point, the analysis of my exome seems a bit boring as it appears I have a relatively healthy genome -- a handful of complex disease-associated variants for diseases that so far I don’t suffer, and new personal variants of unknown significance. And I could discover that in a few hours doing my own “genome project” – this is the closest thing to instant gratification in genomics. More work is needed to assess the impact of my novel variants, and to combine these with the multiple disease-associated susceptibility alleles in scores that predict whether I carry an extra genetic “load” for a given disease.
While I try to figure out all this, and await my full genome sequencing to find if I carry regulatory or structural variants of consequence, I think I will follow the wise advice from my wife – eat better, do more exercise, and enjoy life while you can.
Francisco M. De La Vega, D.Sc., is a Visiting Instructor at the Department of Genetics, Stanford University School of Medicine. A former Distinguished Scientific Fellow in Genetics and Computational Biology at Applied Biosystems, Francisco consults with various biotechnology companies, most recently with Omicia.
1. B. Moore et al., Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics, Genetics in Medicine 13, 210–217 (2011).
2. J. Hampe et al., A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1, Nat. Genet. 39, 207–211 (2006).
3. J. R. Lupski, J. W. Belmont, E. Boerwinkle, R. A. Gibbs, Clan Genomics and the Complex Architecture of Human Disease, Cell 147, 32–43 (2011).
4. F. W. Stahl, The amber mutants of phage T4, Genetics 141, 439–442 (1995).
5. D. G. MacArthur et al., A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes, Science 335, 823–828 (2012).
6. A. Keinan, A. G. Clark, Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants, Science 336, 740–743 (2012).
7. C. D. Bustamante, E. G. Burchard, F. M. De La Vega, Genomics for the world, Nature 475, 163–165 (2011).