Skip to content

Long Short-Term Memory

Why this mattered

Hochreiter and Schmidhuber’s paper mattered because it changed recurrent neural networks from systems that were theoretically suited to sequences but practically unable to learn long dependencies into systems with an explicit mechanism for preserving information over time. The key shift was not merely architectural novelty, but a diagnosis-and-repair of the vanishing-gradient problem: the LSTM cell’s constant error carousel allowed error signals to persist across many time steps, while learned multiplicative gates controlled when information was written, retained, or read. This made long-range temporal credit assignment a trainable feature of the model rather than an accident of initialization, task simplicity, or hand-designed state.

The result was a new practical regime for sequence learning. Before LSTM, recurrent learning methods such as BPTT, RTRL, Elman networks, and related architectures often failed or learned very slowly on tasks requiring memory over long lags. LSTM showed that gradient-based learning could solve artificial problems with dependencies over hundreds or more than a thousand time steps, including tasks previous recurrent algorithms had not solved. That demonstration helped establish gated memory as a central design principle: neural networks could learn not only mappings from inputs to outputs, but also when to store, protect, and expose internal state.

Its later importance came from how widely that principle generalized. LSTMs became a foundation for major advances in speech recognition, handwriting recognition, language modeling, machine translation, captioning, and other sequence tasks in the 2000s and 2010s, especially once larger datasets and GPUs made deep sequence models more practical. Even later architectures that displaced LSTMs in many settings, including gated recurrent units and eventually attention-based Transformers, inherited the same paradigm-level concern: long-range dependencies require mechanisms that preserve usable signal across distance. The 1997 paper therefore sits at a hinge point between early recurrent-network theory and the modern era of trainable sequence models.

Abstract

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

Sources