Progress review: genome sequencing—June 2019
What does it mean to sequence a genome?
A simplistic view of the genome is just a coil of DNA-tape in every cell, that can be read out, base-by-base. But in truth, the genome has lots of edge cases. First, the genomes of humans, and eukaryotes in general, are packed into chromosomes. There are
From A reference standard for genome biology (2018):
Reference genomes are the cornerstone of modern genomics. These high-quality genomes are differentiated from draft genomes by their completeness (low number of gaps), low number of errors, and high percentage of sequence assembled into chromosomes. Although the genomes of viruses and some prokaryotes have complete end-to-end sequence information, nearly all eukaryotic genomes do not. Indeed, even the latest (19th) version of the high-quality human reference has hundreds of gaps, mostly in or near centromeres, telomeres, segmental duplications and ribosomal DNA arrays.
So basically, sequencing a genome from scratch is very expensive, so they make one very good “reference genome” for each species, then for any individual, it’s only necessary to sequence bits and pieces, then compare with the reference. It’s like playing a jigsaw puzzle: if you know the whole picture, it’s much easier to piece together the pieces.
Wouldn’t it be funny, if in the future, nanotechnology becomes good enough to directly manipulate on DNA material, as fluidly as electronics are good at manipulating the voltage levels of semiconductor chips, so that computation can be run directly on DNA, obviating the laborious process of converting the nucleotide bases into silicon “bases”? … Nah, they’ll never be able to run quicksort on the chromosome without converting it to silicon bits first.
That’s just the genome. There are also the epigenomes, the genomes of mitochondria, the genomes of cancer cells...
For some species, even simple information like the number of expected chromosomes is unknown.
The quest for completion continues. There is still much to do.
Human whole genome sequencing
This was plotted in 2014, according to Technology: The $1,000 genome (2014), Nature News. Back then, it cost $5000.
In 2015 it dropped suddenly to $1000 and stagnated, according to NHGRI Genome Sequencing Program (June 7, 2019).
BGI claims to sequence from only $600, with details lacking. I’d put that as the lower bound on sequencing.
I find this curve interesting, for it is definitely not exponential. NIH explained the abrupt drop at 2008 thus:
beginning in January 2008… the sequencing centers transitioned from Sanger-based (dideoxy chain termination sequencing) to ‘second generation’ (or ‘next-generation’) DNA sequencing technologies.
The drop in the middle of 2015 is noted but not clearly explained:
the cost to generate a high-quality ‘draft’ whole human genome sequence in mid-2015 was just above $4,000; by late in 2015, that figure had fallen below $1,500. The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often (but not always) been slightly below these numbers.
Instead, a morass of technical information is given, with no mention of any abrupt technical changes:
In 2015, the most common routine for sequencing an individual’s human genome involves generating a ‘draft’ sequence and comparing it to a reference human genome sequence… nearly all human genome sequencing in 2015 yields high-quality ‘draft’ (but unfinished) sequence. That sequencing is typically targeted to all exons (whole-exome sequencing) or aimed at the entire ~6-billion-base genome (whole-genome sequencing)… The quality of the resulting ‘draft’ sequences is heavily dependent on the amount of average base redundancy provided by the generated data (with higher redundancy costing more).
There is also a warning about comparing price quotes from academic and commercial institutions:
The cost data that NHGRI collects from its funded genome-sequencing groups includes information about a wide range of activities and components, such as: reagents, consumables, [long list] … Note that such cost-accounting does not typically include activities such as quality assurance/quality control (QA/QC), [long list] … Almost certainly, companies vary in terms of which of the items in the above lists get included in any cost estimates, making direct cost comparisons with academic genome-sequencing groups difficult… Anyone comparing costs for genome sequencing should also be aware of the distinction between ‘price’ and ‘cost’ - a given price may be either higher or lower than the actual cost.
The $1000 genome
From Wikipedia:
The “$1,000 genome” catchphrase was first publicly recorded in December 2001 at a scientific retreat to discuss the future of biomedical research following publication of the first draft of the Human Genome Project
The $100 genome
Clearly, if there’s $1000 genome, there has to be $100 (and so on inexorably).
The $100 genome is still not here yet. The earliest mention I could find is from a 2008 report from MIT Tech Review that quotes two predictions:
$1000 genome hopefully before 2011.
$100 genome after 2013.
First prediction is wrong. Extrapolating at 2008 from the graph, 2011 seemed reasonable, but turns out there’s significant plateauing. Second prediction is right, but not very sharp.
It’s a confusing mess to figure out which of these commercial products is supposed to do which, but it seems currently a (mostly complete) personal genome sequencing takes down to $600.
Massive Whole-Genome Sequencing (WGS) projects
Humans
There seems to be on the order of 1 million human genomes sequenced so far. Many are national projects.
Genomics England has almost completed 0.1 million genomes of Englanders so far.
UK Biobank started in 2006, and have sequenced 0.5 million (not sure if WGS, or just genome-wide SNPs).
European ‘1+ Million Genomes’ Initiative aims to have 1 million sequenced genomes from EU by 2022, and as of June 14, 2019, has 21 countries on board.
All Of Us in America aims for 1 million WGS, and GenomeAsia100K aims to sequence 0.1 million Asians.
Some are sub-national, though.
The city of Dubai plans to sequence all its 5 million people.
The city of Nanking in China would sequence 1 million people.
Nonhumans
1001 Genomes project started in 2008, aiming to sequence genomes of strains of Arabidopsis thaliana (the model plant in biology). It concluded with 1135 genomes published in 2016.
B10K project started in 2014, aiming to sequence all 10,560 species of Aves before 2020. So far (July 5, 2017) it has acquired 2500 samples and sequenced just 300. It will surely fail to deliver. There’s another project, OpenWings, started in April 2018, aims for the same in 4 years. As of June 2019, the project is alive, but no firm progress has been reported.
Bat 1K began in 2018, and aims to sequence all bats, defined as the 1288 species of Chiroptera. The May 2019 newsletter reports completion of “deep sequencing” (sequencing multiple times to reduce error) of 5 bats and near-completion of a sixth. They have also received funds to sequence one species from each of the 21 bat families.
Going bigger, the Genome 10K project aims to sequence the genome of at least one individual from each vertebrate genus, approximately 10,000 genomes.
It is a main step of The Vertebrate Genomes Project, which aims to generate reference genomes for all 66,000 extant vertebrate species. It has made some good progress in September 2018, publishing 15 reference genomes from 14 species. They sequenced the female zebra finch (Taeniopygia guttata), the most commonly studied vocal learner, twice (a male and a female), presumably because it is particularly interesting for studying the genetics of language.
The logical conclusion is the Earth BioGenome Project, a project started in November 2018, aiming to sequence all genomes of known eukaryotic species on earth in 10 years:
… sequencing and functionally annotating the genomes of 1.5 million known species of eukaryotes… To date, the genomes of less than 0.2% of eukaryotic species have been sequenced...
The project also seeks to reveal some of the estimated 10 million to 15 million unknown species of eukaryotes, most of which are single cell organisms, insects and small animals in the oceans… Researchers estimate the proposed initiative will take 10 years and cost approximately $4.7 billion… about 1 exabyte of digital storage capacity.
Note: 1 exabyte isn’t that much in terms of REALLY big science. The voracious LHC produces 1 petabyte/sec, too much to record, and so it’s filtered before storage. Even after filtering, it still has archived 200 petabytes on 2017 June 29.
Considering the current cost of one human genome is $1k, this gives an estimate of $1.5 billion. The projected cost is $4.7 billion, which passes the Fermi estimate sanity check.
The project is made of many sub projects. For example:
10KP aims to sequence 10,000 land plants and 4000 protists by 2022, representing every major clade of plants and eukaryotic microbes.
Darwin Tree of Life Project aims to sequence 66,000 UK species in about 2028.
Also microbes
The Earth Microbiome Project studies the microbes of earth:
We use DNA sequencing and mass spectrometry of crowd-sourced samples… set out to analyze 200,000 samples… to produce a global Gene Atlas… approximately 500,000 reconstructed microbial genomes...
It started in 2010. At the end of 2017, it reported 28000 species sequenced. I could not find a price or completion date estimate. Hopefully it will complete faster than 100 years (assuming constant speed)!
I think the main issue is retail vs wholesale. (BGI’s $600 probably means volume pricing.) Also, I think NHGRI publishes averages over deployed machines, including old ones, which overestimates the cost of buying a new machine to create new capacity.
Dante is a retail commercial product of $700 for a whole genome. Dante and Veritas had previously had sales of $200 and $300 (which was probably measuring the demand curve and doesn’t say much about cost). Two days after writing this comment, Veritas cut its price in half to $600, below Dante.
I did not know how much work is going on.
Nice post! The plot from NHGRI looks strangely off in my browser. The plateau at the bottom right should be just above $1000 (but looks as if it is closer to $100). The graph in the actual NHGRI page ( https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost ) is correct, and so is Wikipedia.