Skip to content

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Why this mattered

DePristo et al. mattered because it turned next-generation sequencing variant discovery from a collection of ad hoc read-filtering and SNP-calling steps into a general, reproducible statistical framework. The paper introduced the Genome Analysis Toolkit (GATK) approach for mapping reads, recalibrating base quality scores, locally realigning around indels, discovering candidate variants, and jointly genotyping samples. Its central shift was to treat variant calling as an integrated inference problem over sequencing evidence, not simply as counting mismatches against a reference genome.

This made large-scale human resequencing practically comparable across studies. After this work, researchers could analyze exomes and whole genomes with standardized pipelines that produced calibrated variant calls suitable for population genetics, Mendelian disease studies, and cancer genomics. The framework was especially important because sequencing output was growing faster than manual interpretation methods could handle; GATK helped make thousands and then hundreds of thousands of genomes analyzable under shared assumptions and quality metrics.

Its influence is visible in later genomic resources and clinical workflows that depended on reliable variant catalogs, including large cohort projects such as 1000 Genomes, ExAC, gnomAD, and many rare-disease sequencing programs. By making variant discovery scalable and statistically disciplined, the paper helped establish the computational substrate for modern precision medicine: not the biological discovery of variation itself, but the ability to call, compare, filter, and interpret variation at population scale.

Abstract

(no abstract available)

Sources