Gradient-based learning applied to document recognition¶
Why this mattered¶
LeCun, Bottou, Bengio, and Haffner’s 1998 paper mattered because it gave a concrete, large-scale demonstration that learned representations could replace much of the hand-engineered preprocessing then dominant in pattern recognition. Its central claim was not merely that neural networks could classify digits, but that convolutional architectures, trained end-to-end by gradient descent, were especially well suited to visual data because they built in translation tolerance, local connectivity, and shared weights. In the context of document recognition, this made high-dimensional image classification practical with minimal manual feature design, and showed CNNs outperforming competing methods on a standard handwritten digit task.
The paper also broadened the meaning of “gradient-based learning” beyond isolated classifiers. Through graph transformer networks, it argued that full document-processing pipelines, including segmentation, recognition, and language constraints, could be treated as differentiable or trainable systems optimized against a global objective. That was a major conceptual shift: instead of tuning modules separately and hoping their errors composed well, the system could be trained to improve the final task-level result. The cheque-reading system described in the paper gave this argument unusual force because it was not only experimental; it was deployed commercially and processed millions of cheques per day.
Its later importance comes from how clearly it anticipated the deep-learning revival of the 2010s. The paper did not have today’s data scale, GPUs, or very deep architectures, but it established several principles that became central to later breakthroughs in vision, speech, and sequence modeling: learned features, convolutional inductive bias, backpropagation through composed systems, and optimization of end-to-end performance rather than intermediate hand-designed objectives. In retrospect, it stands as one of the clearest pre-ImageNet demonstrations that neural networks could be engineered into reliable industrial systems and that gradient-based learning was a general paradigm for perception, not a specialized trick for toy recognition tasks.
Abstract¶
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.
Related¶
- cite → A training algorithm for optimal margin classifiers — LeCun's document-recognition work contrasts convolutional neural networks with optimal-margin classifiers as competing supervised methods for handwritten digit recognition.
- cite → Approximation by superpositions of a sigmoidal function — The CNN document-recognition paper cites the sigmoidal universal approximation theorem to justify neural networks' capacity to approximate complex decision functions.
- cite → Receptive fields, binocular interaction and functional architecture in the cat's visual cortex — Convolutional networks borrow the local receptive-field and hierarchical visual-processing concepts established by Hubel and Wiesel's cat visual cortex experiments.
- cite → Backpropagation Applied to Handwritten Zip Code Recognition — LeCun's 1998 document-recognition system extends earlier backpropagation-based handwritten ZIP-code recognition into a larger convolutional network and graph-transformer pipeline.
- enables → Going deeper with convolutions — LeCun's convolutional document-recognition system enables GoogLeNet by demonstrating end-to-end gradient-trained convolutional networks for visual recognition.
- enables → Human-level control through deep reinforcement learning — LeNet's convolutional feature learning provided the visual representation method that DQN used to process Atari screen pixels.
- enables → Image Super-Resolution Using Deep Convolutional Networks — LeNet's convolutional weight sharing and end-to-end gradient training provided the CNN template adapted by SRCNN for super-resolution.
- enables → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — LeCun et al. showed that convolutional networks can learn hierarchical visual features end-to-end, which R-CNN reused through a deep CNN feature extractor for object proposals.
- enables → Deep Learning with Differential Privacy — LeCun et al.'s gradient-based neural-network training is the optimization setting that Deep Learning with Differential Privacy adapts by making SGD differentially private.
- enables → Convolutional Neural Networks for Sentence Classification — LeCun's document-recognition CNN demonstrated convolution and pooling for pattern extraction, which sentence CNNs adapted from images to word-sequence classification.
- enables → A Fast Learning Algorithm for Deep Belief Nets — LeCun's backpropagation-trained neural networks enabled deep belief nets by demonstrating that multilayer representations could be learned effectively for perception tasks.
- cite ← Going deeper with convolutions — GoogLeNet extends the convolutional neural network paradigm established by LeNet for gradient-based visual recognition.
- cite ← Human-level control through deep reinforcement learning — The DQN paper cites LeNet to ground its use of convolutional neural networks for learning visual features directly from raw images.
- cite ← Image Super-Resolution Using Deep Convolutional Networks — SRCNN adapts the convolutional-network learning paradigm established by LeNet-style document recognition to low-level image super-resolution.
- cite ← Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — R-CNN builds on the convolutional neural network architecture popularized by LeNet for visual recognition tasks.
- cite ← Deep Learning with Differential Privacy — Deep Learning with Differential Privacy applies private stochastic gradient training to convolutional neural networks descended from LeNet.
- cite ← Convolutional Neural Networks for Sentence Classification — Kim adapts the convolution-and-pooling architecture popularized by LeCun et al. for document recognition to sentence-level text classification.
- cite ← A Fast Learning Algorithm for Deep Belief Nets — Hinton et al. contrast deep belief net pretraining with the supervised gradient-based convolutional learning demonstrated by LeCun et al. for document recognition.
- enables ← A training algorithm for optimal margin classifiers — Optimal-margin classifiers influenced LeNet's comparison to margin-based supervised learning for document recognition.
- enables ← Approximation by superpositions of a sigmoidal function — Universal approximation by sigmoidal networks justified multilayer neural networks as expressive function approximators for LeNet-style recognition.
- enables ← Receptive fields, binocular interaction and functional architecture in the cat's visual cortex — Hubel and Wiesel's visual-cortex receptive fields inspired convolutional networks' local receptive fields and hierarchical feature maps.
- enables ← Backpropagation Applied to Handwritten Zip Code Recognition — Backpropagation for handwritten zip-code recognition directly preceded LeNet's gradient-trained convolutional architecture for document recognition.
Sources¶
- DOI: https://doi.org/10.1109/5.726791
- OpenAlex: https://openalex.org/W2112796928