Skip to content

Deep Residual Learning for Image Recognition

Why this mattered

Before ResNet, increasing depth was widely understood to be desirable but practically fragile: very deep convolutional networks often became harder to optimize, and adding layers could increase training error rather than merely overfit. He, Zhang, Ren, and Sun reframed the problem by having stacked layers learn a residual mapping relative to their input, implemented through identity shortcut connections. This made depth itself a usable design axis. The paper’s 152-layer ImageNet model, substantially deeper than VGG-style networks while using lower computational complexity, showed that very deep networks could be trained effectively and could improve accuracy rather than degrade it.

The paradigm shift was not only a better ImageNet result, though the 3.57% top-5 error ensemble and ILSVRC 2015 wins made the result impossible to ignore. ResNet supplied a general architectural principle: preserve an easy path for information and gradients while letting learned layers model refinements. That principle quickly became part of the default vocabulary of deep learning, appearing in later convolutional systems for detection and segmentation, and influencing architectures beyond vision wherever very deep models needed stable optimization.

After this paper, depth stopped being treated mainly as an optimization hazard and became a scalable resource. ResNet backbones became standard infrastructure for object detection, instance segmentation, medical imaging, remote sensing, and representation learning, while residual connections became central to later breakthroughs including very deep sequence models and transformer architectures. Its lasting importance is that it turned a practical training workaround into a broadly reusable design pattern for building much larger, more capable neural networks.

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Sources