Skip to content

Speech recognition with deep recurrent neural networks

Why this mattered

Graves, Mohamed, and Hinton showed that recurrent networks could be made competitive for a central speech-recognition benchmark when depth, LSTM memory, end-to-end sequence training, and regularization were combined in one system. At the time, deep feedforward acoustic models had recently revived neural-network speech recognition, but they still relied on fixed context windows and conventional frame-level training assumptions. This paper demonstrated that a deep LSTM trained with Connectionist Temporal Classification could use variable-length temporal context and learn without a pre-specified input-output alignment, reaching a reported 17.7% error on TIMIT phoneme recognition.

The shift was not merely a better benchmark number. It made plausible a different design pattern for speech systems: replace carefully engineered alignment pipelines and limited-context classifiers with trainable sequence models that directly map acoustic sequences to label sequences. That helped establish recurrent neural networks, especially LSTMs, as serious acoustic models rather than specialist tools for handwriting or small sequence tasks.

Its influence is visible in the next wave of speech and sequence-learning breakthroughs. End-to-end automatic speech recognition systems, encoder-decoder models with attention, RNN Transducer systems, and later large-scale neural audio models all built on the same premise that temporal structure and alignment could be learned inside the model. Although transformers later displaced recurrent networks in many settings, the paper was part of the transition from hybrid, hand-aligned speech recognition toward fully neural sequence-to-sequence learning.

Abstract

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

Sources