Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring¶
Why this mattered¶
Golub et al. made a decisive case that cancer could be classified by a tumor’s global molecular state rather than only by morphology, immunophenotype, cytogenetics, or a small number of known markers. Its acute leukemia test case was deliberately important: AML and ALL were already clinically distinct diseases, yet the study showed that unsupervised gene-expression patterns could rediscover that distinction without being told the labels. That turned DNA microarrays from a descriptive genomic technology into a practical framework for class discovery: finding disease subtypes from high-dimensional molecular measurements.
The paper also established that expression profiles could support class prediction, assigning new tumors to diagnostic categories using learned gene-expression signatures. This was a paradigm shift because it made cancer classification a computational and genome-scale problem: instead of asking whether a single suspected marker was present, researchers could ask whether the entire transcriptional program of a tumor matched a known biological class or suggested a new one. After this, it became newly plausible to define clinically meaningful tumor subtypes, prognostic signatures, and treatment-relevant categories from data rather than from prior pathology alone.
Its influence is visible in later molecular taxonomies of breast cancer, lymphoma, glioblastoma, and many other cancers, as well as in The Cancer Genome Atlas and modern precision oncology. The paper did not by itself solve clinical translation, reproducibility, or treatment selection, but it supplied one of the central conceptual templates: cancers are heterogeneous molecular systems, and measuring that heterogeneity systematically can reveal diagnosis, lineage, prognosis, and therapeutic vulnerabilities. In that sense, it helped move oncology toward the genomic era, where classification increasingly depends on molecular profiles as much as on the microscope.
Abstract¶
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Related¶
- enables → Why Most Published Research Findings Are False — Gene-expression cancer classification exemplified high-dimensional biomedical discovery with many tested associations, a setting central to Ioannidis's false-findings argument.
- enables → Regularization and Variable Selection Via the Elastic Net — Golub's gene-expression classification problem exemplified the high-dimensional correlated-predictor setting that elastic net regularization was designed to handle.
- cite ← Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications — The breast carcinoma study applies gene-expression-based class discovery and prediction to distinguish clinically relevant tumor subclasses.
- cite ← Why Most Published Research Findings Are False — Ioannidis cites gene-expression cancer classification as a high-dimensional biomedical discovery setting vulnerable to false published findings.
- cite ← Molecular portraits of human breast tumours — Molecular portraits extends Golub et al.'s gene-expression-based cancer class discovery and prediction to breast-tumor subtypes.
- cite ← Regularization and Variable Selection Via the Elastic Net — The elastic net paper cites gene-expression cancer classification as a high-dimensional, correlated-predictor setting where regularized variable selection is useful.