A map of human genome variation from population-scale sequencing¶
Why this mattered¶
TBD
Abstract¶
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. This issue of Nature contains the first publication from The 1000 Genomes Project, an international collaboration that will produce an extensive public catalogue of human genetic variation. The plan, in fact, is to sequence about 2,000 unidentified individuals from 20 populations around the world. This first paper presents the results from the project's pilot phase, testing three different strategies for genome-wide sequencing with high-throughput platforms: low-coverage whole-genome sequencing of 179 individuals in three population groups, high-coverage sequencing of two mother–father–child trios, and exon-targeted sequencing of 697 individuals from seven populations. The goal of the 1000 Genomes Project is to provide in-depth information on variation in human genome sequences. In the pilot phase reported here, different strategies for genome-wide sequencing, using high-throughput sequencing platforms, were developed and compared. The resulting data set includes more than 95% of the currently accessible variants found in any individual, and can be used to inform association and functional studies.
Related¶
- cite → The Sequence Alignment/Map format and SAMtools — The 1000 Genomes Project cites Li et al. because SAMtools and the SAM/BAM format provided core infrastructure for aligning and calling variants from population-scale sequencing data.
- enables → A global reference for human genetic variation — The 2010 1000 Genomes pilot established population-scale sequencing and variant-catalog methods that enabled the 2015 global reference haplotype map.
- enables → The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans — The 1000 Genomes variation map enabled GTEx to use population-scale human variant references for genotype imputation and variant interpretation.
- cite ← An integrated encyclopedia of DNA elements in the human genome — ENCODE uses population-scale human variation maps as genomic context for interpreting regulatory elements across the human genome.
- cite ← A framework for variation discovery and genotyping using next-generation DNA sequencing data — The GATK framework was motivated and validated by population-scale variant discovery efforts such as the 1000 Genomes human variation map.
- cite ← A global reference for human genetic variation — The 2015 global reference extends the 2010 1000 Genomes pilot map from population-scale sequencing into a larger catalog of human variation.
- cite ← An integrated map of genetic variation from 1,092 human genomes — The 2012 1000 Genomes integrated map extends the 2010 pilot map from population-scale sequencing to a larger, more comprehensive catalog of human genetic variants.
- cite ← The variant call format and VCFtools — The VCF paper cites the 1000 Genomes pilot because population-scale sequencing motivated a standard format for representing human genetic variants.
- cite ← The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans — The GTEx pilot uses population-scale human variation maps as reference context for interpreting genetic variants.
- cite ← Biological insights from 108 schizophrenia-associated genetic loci — The 1000 Genomes variation map supplies reference haplotypes and linkage disequilibrium structure used to interpret schizophrenia-associated loci.