Extended-Connectivity Fingerprints¶
Why this mattered¶
ECFP mattered because it changed molecular fingerprints from mostly search-oriented encodings into a practical language for predictive chemistry. Earlier topological fingerprints were often defined around fixed path patterns or substructure-screening needs; Rogers and Hahn formalized a circular, atom-neighborhood-based representation designed for structure-activity modeling. That made a molecule’s local chemical environments available as sparse, computable features without requiring experts to predefine every relevant substructure in advance. The result was a fingerprint that was fast enough for large libraries, expressive enough to capture many medicinal-chemistry motifs, and interpretable enough that model signals could often be traced back to chemically meaningful fragments.
The paradigm shift was not only algorithmic but operational. ECFP made it newly routine to train machine-learning and statistical models directly on large chemical collections using a standardized representation that worked across targets, assays, and compound series. Because the features were generated from molecular connectivity rather than hand-curated descriptors, ECFP helped move cheminformatics toward scalable virtual screening, activity prediction, similarity search tuned for bioactivity, and fragment-level model interpretation. It became one of the default baselines against which later molecular representations were judged.
Its later influence is visible in both classical QSAR and modern molecular machine learning. Many successful ligand-based models, including random forests, support-vector machines, Bayesian models, and deep neural networks, used ECFP-like fingerprints as input because they offered a strong balance of simplicity and predictive power. At the same time, the paper helped set the conceptual stage for graph-based neural methods: modern message-passing networks also build molecular representations by iteratively aggregating local atomic neighborhoods. ECFP can therefore be seen as a crucial bridge between rule-based chemical informatics and learned molecular representation, giving the field a durable, high-performing standard while clarifying what later representation-learning methods needed to improve upon.
Abstract¶
Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
Related¶
- cite → Classification and Regression Trees. — Extended-connectivity fingerprints cites CART as a decision-tree learning method used for molecular classification and regression tasks.
- enables ← Classification and Regression Trees. — CART popularized tree-based decision rules, the kind of interpretable branching structure later used to reason about molecular substructures encoded by extended-connectivity fingerprints.