Long Short-Term Memory¶
Why this mattered¶
Hochreiter and Schmidhuber’s paper mattered because it changed recurrent neural networks from systems that were theoretically suited to sequences but practically unable to learn long dependencies into systems with an explicit mechanism for preserving information over time. The key shift was not merely architectural novelty, but a diagnosis-and-repair of the vanishing-gradient problem: the LSTM cell’s constant error carousel allowed error signals to persist across many time steps, while learned multiplicative gates controlled when information was written, retained, or read. This made long-range temporal credit assignment a trainable feature of the model rather than an accident of initialization, task simplicity, or hand-designed state.
The result was a new practical regime for sequence learning. Before LSTM, recurrent learning methods such as BPTT, RTRL, Elman networks, and related architectures often failed or learned very slowly on tasks requiring memory over long lags. LSTM showed that gradient-based learning could solve artificial problems with dependencies over hundreds or more than a thousand time steps, including tasks previous recurrent algorithms had not solved. That demonstration helped establish gated memory as a central design principle: neural networks could learn not only mappings from inputs to outputs, but also when to store, protect, and expose internal state.
Its later importance came from how widely that principle generalized. LSTMs became a foundation for major advances in speech recognition, handwriting recognition, language modeling, machine translation, captioning, and other sequence tasks in the 2000s and 2010s, especially once larger datasets and GPUs made deep sequence models more practical. Even later architectures that displaced LSTMs in many settings, including gated recurrent units and eventually attention-based Transformers, inherited the same paradigm-level concern: long-range dependencies require mechanisms that preserve usable signal across distance. The 1997 paper therefore sits at a hinge point between early recurrent-network theory and the modern era of trainable sequence models.
Abstract¶
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Related¶
- enables → Show and tell: A neural image caption generator — LSTM supplied the recurrent sequence decoder that Show and Tell used to generate captions from image features.
- enables → Deep Residual Learning for Image Recognition — LSTM's gated additive state path anticipated the identity-like information flow that ResNets used to ease optimization of deep networks.
- enables → TensorFlow: a system for large-scale machine learning — LSTM introduced gated recurrent units for long-range sequence learning, one of the neural architectures TensorFlow was designed to train and deploy at scale.
- enables → Squeeze-and-Excitation Networks — LSTM introduced gated multiplicative control of information flow, a mechanism echoed by squeeze-and-excitation blocks that gate convolutional feature channels.
- enables → Speech recognition with deep recurrent neural networks — LSTM's gated recurrent memory enabled later deep RNN acoustic models to learn long-range temporal structure in speech.
- cite ← Show and tell: A neural image caption generator — Show and Tell uses an LSTM decoder to generate image-caption word sequences from CNN image features.
- cite ← Deep Residual Learning for Image Recognition — ResNet cites LSTM as an example of shortcut-like connections that ease optimization in deep sequence models.
- cite ← TensorFlow: a system for large-scale machine learning — TensorFlow cites LSTM as a recurrent neural network architecture whose training and deployment motivate flexible computation graphs.
- cite ← Squeeze-and-Excitation Networks — Squeeze-and-Excitation Networks relate to LSTM through the shared gating mechanism that adaptively modulates information flow.
- cite ← Speech recognition with deep recurrent neural networks — The speech-recognition paper uses LSTM recurrent units to model long-range temporal dependencies in acoustic sequences.