Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments¶
Why this mattered¶
Smyth’s 2004 paper mattered because it turned differential expression analysis from a collection of experiment-specific tests into a general statistical framework for high-throughput biology. Microarray studies often had thousands of genes but only a few biological replicates, making ordinary gene-by-gene variance estimates unstable. By embedding linear models in an empirical Bayes framework, the paper made it possible to “borrow strength” across genes: each gene retained its own model and contrast, but its variance estimate was moderated toward a pooled distribution learned from the full experiment. This changed the practical meaning of small-sample genomics, allowing designed experiments with multiple treatments, contrasts, missing spots, quality weights, and both one- and two-color arrays to be analyzed with a coherent inferential procedure.
The conceptual shift was not merely technical. The moderated t- and F-statistics provided familiar frequentist outputs while using hierarchical Bayesian shrinkage to stabilize inference. That combination helped make rigorous genome-wide testing usable for working biologists, because it fit naturally into experimental design, contrast testing, and multiple-testing workflows. The paper also avoided dependence on a fully specified prior distribution for non-null fold changes, making the method more robust and broadly applicable than earlier posterior-odds formulations.
Its influence extended well beyond microarrays. Implemented in the limma software ecosystem, the approach became a standard analysis engine for expression studies and helped establish linear modeling plus empirical Bayes shrinkage as a default pattern in genomics. Later RNA-seq methods inherited the same broad lesson: high-dimensional biological assays need models that share information across features while preserving feature-level tests. In that sense, Smyth’s paper helped define the statistical architecture for modern transcriptomics, where variance moderation, dispersion shrinkage, contrast-based designs, and genome-wide error control became expected rather than exceptional.
Abstract¶
The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.
Related¶
- enables → Differential expression analysis for sequence count data — limma's empirical Bayes shrinkage of gene-wise variance estimates enabled DESeq's analogous moderation of dispersion estimates for differential expression.
- cite ← Differential expression analysis for sequence count data — DESeq cites limma's empirical Bayes differential-expression framework as a precedent while replacing microarray normal models with count-based modeling.