By Jeffrey Rosenfeld, PhD
In the past few years, the prices of sequencing have plummeted and now for a few thousand dollars, the complete sequence of an individual can be obtained. Even so, many scientists have opted to sequence just the exome (coding regions) of an individual and to ignore the rest of the genome. This focus on the exome has some justification, but I think that it is shortsighted and despite the higher cost, the sequencing of a complete genome is more valuable even if that means sequencing fewer samples.
The supporters of exome sequencing generally make the following points to justify their position:
A. The sequencing of an exome is much cheaper than the sequencing of a genome. It must be substantially cheaper to sequence 1% of the genome than the whole genome.
B. We don’t understand how to interpret non-coding variants and therefore we should limit our sequencing to genes that are well annotated.
C. Variants that are associated with a genetic disease are more likely to be found in a coding region since they directly alter the structure of a protein.
I am not going to deny that there is some validity to these points, but I don’t think that they outweigh the shortcomings of exome sequencing and the benefits of whole genome sequencing that I will outline below. I understand that this is a contentious issue, and I welcome your comments whether you agree or disagree with my position.
The first reason the people generally look to exome sequencing is that of cost. Intuitively, the sequencing of 1% of the genome (the exome) should be cheaper than sequencing the entire genome. While this is true, the price differential is nowhere near 1:100 and is closer to 2:1 or 3:1 depending upon how the costs of the sequencing is calculated. Currently, a whole genome costs ~$4,000 and an exome costs ~$1,500. Why are these prices so close to each other? The answer is that the actual reagent cost of running the sequencer is not the only factor in the cost of a genome or an exome. For either type of experiment, library prep is required along with the costs associated with setting up a sequencing run of any size. For an exome, there is the additional cost of purchasing the selection kit which allow one to extract the coding sequences from raw DNA either using a microarray or in solution. This kit can cost several hundred dollars, and is therefore a substantial portion of the cost of exome sequencing.
Because of the lack of strong cost differential, the economic argument of favoring exome sequencing is not very strong. For the same amount of funding, a researcher would need to choose between say, 30 exomes and 10 genomes. While 30 samples are obviously better than 10, this is not a great differential. It is much less than the 1:100 differential that one would naively think of concerning the price of genome and exome sequences. An additional factor affecting the cost of exome sequencing is the time required to perform the hybridization. For the Nimblegen protocol, 72 hours of time are required for hybridization and 24 hours are required for the Agilent approach. These times add a delay into the time taken from sample to sequence which may be problematic for clinical applications. As an example, the Ion Torrent machine is being pitched as a tool for rapid sequencing that will produce results in a single day. When an exome is targeted using Agilent or NImblegen, this will grow to at least 2 or 4 days of time.
2. Exome coverage
The definition of an exome is somewhat elusive. It can refer to:
a) All of coding exons of the genome
b) A + microRNA genes
c) A + 5’ UTR and 3’ UTR regions
d) Unannotated transcripts that have been discovered in RNA-seq experiments or from the ENCODE project
e) All "functional" portions of the genome
These five definitions will include very different portions of the genome and some of them such as E are difficult to define in and of themselves. It has been shown in multiple studies that there pervasive transcription along substantial portions of the genome. Should all of these regions be considered part of the exome? In general these are not included in the exome kits since their inclusion would push the size of the exome much closer to that of a genome and any potential savings from the lesser amount of sequencing will decrease. Instead, the exome is generally limited to coding genes with some level of annotation along with microRNAs and to some extent UTRs.
Each of the different vendors that produce exome kits have taken different approaches to defining the exome. A recent paper http://www.nature.com/nbt/journal/v29/n10/full/nbt.1975.html compared the exome selection offering from the three main players in the field Agilent, Nimblegen and Illumina.
This figure gives a great comparison of the different technologies. Firstly, the approaches to selecting the exome sequence differ. Nimblegen uses overlapping DNA baits, Agilent uses RNA baits which are distinct but contiguous and Illumina uses distinct DNA baits that are not contiguous and contain breaks of un-targeted sequence. Because of this, Nimblegen contains many times the number of probes as the other two technologies. The rest of the figure shows Venn diagrams illustrating the overlap between the targeted regions. For two different defintions of human genes, RefSeq and Ensembl, there is substantial agreement between the technologies as indicated by the 28.5 and 28.4Mb of sequence that they all cover. The biggest discrepancy is with regard to UTR regions where Illumina has 28 Mb that are missing from the other two platforms.
A different technique to assess coverage is to look at the amount of the exome target from a particular kit that is covered at a sufficient threshold to make a confident call of a variant. For many scientists, a threshold of 20x coverage is required to trust a variant derived from an exome sequence. Any loci with lesser amounts of coverage are ignored. Since the general sequencing coverage for an exome is 80x, in theory, it should be no problem to achieve 20x coverage of the entire targeted region. In practice, this is not the case for three reasons. Firstly, exome sequencing, as with all sequencing, produces reads in a statistical distribution and not evenly along the genome. Randomly, some regions are going to have their DNA sequenced more often and thus have a higher number of reads. This idea forms the basis of the famous Lander-Waterman statistics that are used for designing sequencing projects. The second reason for variation in coverage is that some of the baits used for selecting the exomic DNA will have a higher affinity than other baits, mainly due to GC content.. Those probes with higher affinity for their targets will produce greater amounts of sequenced DNA. The final concern is due to the repetitive nature of the genome. The selection probes need to target a unique location in the genome to ensure that they are truly obtaining the DNA that they intend to select. If the targeted region is repeated in the genome, then sequence from all of matching regions will be equally selected. Many human genes share domains with other proteins, and any shared sequences cannot be targetted. This is an equivalent problem to the unique mapping of sequencing reads which is a big concern in the use of short sequence reads. Any reads that map to more than one location of the genome cannot be uniquely placed and are generally discarded.
These concerns are illustrated in this figure from Agilent regarding their SureSelect sequencing:
This is an old figure, but I think that while the numbers might have changed a bit, the overall message remains. The read depth is extremely variable and you do not achieve anything close to 100% coverage of the exome. While accurate data is available for 80% of the exome (depth > 20x) this also means that 20% of the exome is missed. In odds terms, this means that for a disease study where an exomic variant correlates with the disease, there is a 1:5 chance of not having the variant included in the data. A researcher could conclude that there is no coding variant associated with their disorder when in actuality, it was just that it fell into the 20% that was missed. An error level of 20% is not trivial and cannot be lightly dismissed.
3. Whole Genomes
When a whole genome is sequenced, many of the issues regarding exome sequencing are not relevant. There is no need to buy a hybridization kit or to wait for the kit to hybridize. While there are sequencing biases (as there are in any sequencing experiment), there are not the additional biases introduced from the exome selection. Overall, there is probably the standard 5% error in sequencing giving a confidence level of 95%.
But, the biggest gain from a whole-genome sequencing is that the entire genome (excluding some unclonable regions) is obtained. If one wants to focus on the exome because it is easier to understand and interpret, they can easily filter out the non-coding portions of the genome to obtain an in silico exome. This is an easy action to perform and if a positive result is not found in the exome, then you already have the rest of the genome sequenced to begin looking for an intronic variant related to splicing, or a non-coding promoter or enhancer variant. In a traditional exome experiment, this is not possible. If no variant is found in the exome, then there is no result and one needs to go back and sequence the whole genome again from scratch.
To give a picture of the fraction of disease associated variants that are coding or non-coding, I looked at the UCSC collection of GWAS studies. The current list contains 5454 unique SNPs loci that were identified as part of a GWAS study. Of these SNPs, 3047 (56%) of them are not within coding genes. Thus, more than half of the identified important genomic variants are not in coding regions and would not be covered by exomes. (Some of these SNPs may be in UTRs or non-coding RNAs which are targeted by some of the platforms)
I see this as a betting situation. Would you rather spend $1,500 and have a 44% chance of getting the answer of spend $4,000 and have a 95% chance of getting the answer? I think that the $4,000 genome is much more reasonable. Just because we don’t understand non-coding sequence does not mean that we can or should ignore it. As scientists, we have an obligation to try our best to investigate human disease and not to only focus on things that are easy to understand.
As a final point, there has been some recent talk concerning variants that are only found from exome sequencing and not genome sequencing. These results are not a fair comparison of apples to apples. The exomes are generally sequenced at 80x coverage, and the genomes are sequenced at 30x coverage. For the specific variants under discussion, 80x sequencing coverage is required to identify them from any technique. This 80x coverage could have been of just the exome, or the entire genome. If the whole genome were sequenced to 80x for a true comparison, then I am confident that there would not have been an advantage for the exome over the genome.
Jeffrey Rosenfeld is a Bioinformatics Scientist in the Division of High Performance and Research Computing at the University of Medicine and Dentistry of New Jersey (UMDNJ) and a Research Associate in the Division of Invertebrate Zoology at the American Museum of Natural History