Inference of Population Structure Using Multilocus Genotype Data¶
Why this mattered¶
Before this paper, population structure was often treated as a nuisance to be controlled or a pattern to be summarized with distance-based methods. Pritchard, Stephens, and Donnelly made it directly inferable from multilocus genotype data through an explicit probabilistic model: individuals could be assigned to latent populations, and, crucially, could be represented as admixed rather than forced into a single category. That shifted the field from describing genetic similarity after the fact to estimating ancestry components, migration, and hybridization within a coherent statistical framework.
The practical effect was broad. The method made it possible to detect cryptic structure in samples, identify migrants and hybrids, and correct for population stratification in association studies, all using genetic markers that were becoming increasingly available at the time. Its assumptions were deliberately general: populations were characterized by allele frequencies, markers did not need a specific mutation model, and the number of populations could be explored rather than fixed by geography or prior labels. The resulting software, STRUCTURE, became a standard tool because it turned a difficult inferential problem into something many empirical geneticists could apply to real datasets.
Its influence also extended beyond the original microsatellite-era setting. Later work in genome-wide association studies, human population history, conservation genomics, ancient DNA, and admixture mapping built on the same conceptual move: ancestry could be modeled as latent, probabilistic, and locally or globally mixed. Subsequent tools scaled the idea to dense SNP data and much larger cohorts, but the paradigm introduced here remained central: genetic populations are not merely labels attached to samples, but statistical structure that can be inferred from multilocus variation.
Abstract¶
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci-e.g. , seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/ approximately pritch/home. html.
Related¶
- enables → Principal components analysis corrects for stratification in genome-wide association studies — STRUCTURE modeled hidden population subgroups from multilocus genotypes, motivating PCA as a faster correction for stratification in GWAS.
- enables → PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses — STRUCTURE's multilocus genotype framework for population stratification enabled PLINK's population-based association analysis tools and stratification-aware workflows.
- cite ← Principal components analysis corrects for stratification in genome-wide association studies — The PCA GWAS paper uses population-structure ideas established by STRUCTURE to correct ancestry stratification in association studies.
- cite ← PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses — PLINK cites STRUCTURE as a genotype-based method for inferring population ancestry and detecting stratification.