Skip to content

Gradient-based learning applied to document recognition

Why this mattered

LeCun, Bottou, Bengio, and Haffner’s 1998 paper mattered because it gave a concrete, large-scale demonstration that learned representations could replace much of the hand-engineered preprocessing then dominant in pattern recognition. Its central claim was not merely that neural networks could classify digits, but that convolutional architectures, trained end-to-end by gradient descent, were especially well suited to visual data because they built in translation tolerance, local connectivity, and shared weights. In the context of document recognition, this made high-dimensional image classification practical with minimal manual feature design, and showed CNNs outperforming competing methods on a standard handwritten digit task.

The paper also broadened the meaning of “gradient-based learning” beyond isolated classifiers. Through graph transformer networks, it argued that full document-processing pipelines, including segmentation, recognition, and language constraints, could be treated as differentiable or trainable systems optimized against a global objective. That was a major conceptual shift: instead of tuning modules separately and hoping their errors composed well, the system could be trained to improve the final task-level result. The cheque-reading system described in the paper gave this argument unusual force because it was not only experimental; it was deployed commercially and processed millions of cheques per day.

Its later importance comes from how clearly it anticipated the deep-learning revival of the 2010s. The paper did not have today’s data scale, GPUs, or very deep architectures, but it established several principles that became central to later breakthroughs in vision, speech, and sequence modeling: learned features, convolutional inductive bias, backpropagation through composed systems, and optimization of end-to-end performance rather than intermediate hand-designed objectives. In retrospect, it stands as one of the clearest pre-ImageNet demonstrations that neural networks could be engineered into reliable industrial systems and that gradient-based learning was a general paradigm for perception, not a specialized trick for toy recognition tasks.

Abstract

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

Sources