Skip to content

Maximum Likelihood from Incomplete Data Via the EM Algorithm

Why this mattered

Before Dempster, Laird, and Rubin, many likelihood problems with missing, censored, latent, grouped, or otherwise incomplete data were treated as separate computational tricks. This paper unified them under a single principle: alternate between estimating the expected complete-data log likelihood and maximizing it. The key shift was not merely a new optimization routine, but a way to make latent-variable likelihoods operational. Problems that had seemed analytically awkward or computationally fragmented could now be expressed as instances of one general algorithm with a monotone likelihood property.

That unification made maximum-likelihood estimation practical across fields where the quantities of interest are only indirectly observed. Finite mixture models, factor analysis, variance components, censored-data models, missing-data analysis, and empirical Bayes or hyperparameter estimation all gained a common computational language. The paper showed that “incomplete data” was not a special nuisance case but a broad structural pattern: if a simpler complete-data model could be imagined, estimation could often proceed by iteratively filling in its sufficient information in expectation.

Its influence is visible in much of later statistical machine learning. EM became a standard engine for mixture modeling, hidden Markov models, probabilistic clustering, latent-class analysis, and many graphical-model methods, and it helped normalize the idea that inference and optimization could be interleaved around unobserved variables. Later variational inference, expectation propagation, and modern latent-variable learning methods do not simply replace EM, but they inherit its central viewpoint: difficult observed-data likelihoods can often be attacked by introducing latent structure and optimizing a tractable surrogate.

Abstract

Summary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Sources