Skip to content

Accelerated Profile HMM Searches

Why this mattered

Before this work, profile HMMs were already one of the most principled ways to detect remote sequence homology: they modeled position-specific conservation, insertions, deletions, and uncertainty more naturally than pairwise alignment heuristics. The limitation was practical rather than conceptual. Sensitive profile-HMM searches were often too slow for routine use at the scale of growing protein databases, so many workflows still depended on faster but less expressive tools such as BLAST. Eddy’s 2011 paper mattered because it changed that tradeoff. By introducing the MSV filter and sparse rescaling, HMMER3 made probabilistic profile-HMM search fast enough to become an everyday database-search instrument rather than a specialist method reserved for smaller or slower analyses.

The paradigm shift was not merely “faster HMMER.” The paper showed that a carefully designed heuristic pipeline could preserve nearly all of the sensitivity of full profile-HMM inference while rejecting most database sequences cheaply. MSV supplied a statistically interpretable, vectorized first-pass filter; promising hits then flowed into more exact Forward/Backward analysis. This made it possible to search large protein databases with profile HMMs at roughly BLAST-like speeds while retaining the advantages of profile-based probabilistic modeling. In practice, that enabled broader and more systematic annotation of protein families, domains, and remote homologs, especially through resources and workflows built around HMMER and profile-HMM libraries such as Pfam.

Its later importance lies in how it helped normalize a pattern now common in computational biology: use a fast, statistically calibrated filter to make a richer probabilistic or model-based method usable at scale. HMMER3 became infrastructure for genome annotation, metagenomics, protein-family curation, and comparative genomics, where millions to billions of sequence comparisons are routine. Subsequent breakthroughs in large-scale protein analysis, from massive reference databases to modern structure and function prediction pipelines, depended on reliable ways to place sequences into evolutionary families. This paper helped make that family-level search both sensitive and computationally routine.

Abstract

Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the "multiple segment Viterbi" (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call "sparse rescaling". These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.

Sources