Deep Residual Learning for Image Recognition¶
Why this mattered¶
Before ResNet, increasing depth was widely understood to be desirable but practically fragile: very deep convolutional networks often became harder to optimize, and adding layers could increase training error rather than merely overfit. He, Zhang, Ren, and Sun reframed the problem by having stacked layers learn a residual mapping relative to their input, implemented through identity shortcut connections. This made depth itself a usable design axis. The paper’s 152-layer ImageNet model, substantially deeper than VGG-style networks while using lower computational complexity, showed that very deep networks could be trained effectively and could improve accuracy rather than degrade it.
The paradigm shift was not only a better ImageNet result, though the 3.57% top-5 error ensemble and ILSVRC 2015 wins made the result impossible to ignore. ResNet supplied a general architectural principle: preserve an easy path for information and gradients while letting learned layers model refinements. That principle quickly became part of the default vocabulary of deep learning, appearing in later convolutional systems for detection and segmentation, and influencing architectures beyond vision wherever very deep models needed stable optimization.
After this paper, depth stopped being treated mainly as an optimization hazard and became a scalable resource. ResNet backbones became standard infrastructure for object detection, instance segmentation, medical imaging, remote sensing, and representation learning, while residual connections became central to later breakthroughs including very deep sequence models and transformer architectures. Its lasting importance is that it turned a practical training workaround into a broadly reusable design pattern for building much larger, more capable neural networks.
Abstract¶
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Related¶
- cite → Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — ResNet cites Faster R-CNN as the object-detection framework in which residual networks improve detection accuracy.
- cite → Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification — ResNet cites PReLU rectifier initialization work as evidence that activation and initialization choices enable very deep ImageNet classifiers.
- cite → The Pascal Visual Object Classes (VOC) Challenge — ResNet cites the PASCAL VOC benchmark to evaluate residual features on object detection and segmentation transfer tasks.
- cite → Long Short-Term Memory — ResNet cites LSTM as an example of shortcut-like connections that ease optimization in deep sequence models.
- cite → Going deeper with convolutions — ResNet contrasts residual learning with GoogLeNet's Inception modules as another route to deeper convolutional networks.
- cite → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — ResNet cites R-CNN as a prior deep feature hierarchy for object detection and semantic segmentation.
- cite → ImageNet Large Scale Visual Recognition Challenge — ResNet uses the ImageNet Large Scale Visual Recognition Challenge as its main classification benchmark.
- cite → Backpropagation Applied to Handwritten Zip Code Recognition — ResNet cites LeCun's zip-code CNN work as an early demonstration of backpropagation-trained convolutional networks.
- cite → ImageNet classification with deep convolutional neural networks — ResNet cites AlexNet as the breakthrough ImageNet convolutional network that established deep CNNs for visual recognition.
- enables → Highly accurate protein structure prediction with AlphaFold — Residual networks enabled AlphaFold's very deep neural architectures to propagate pairwise and spatial protein-structure features without optimization collapse.
- enables → Segment Anything — Residual learning enabled very deep vision backbones, supporting SAM's high-capacity image encoder for general-purpose segmentation.
- cite ← Mastering the game of Go without human knowledge — AlphaGo Zero relies on residual neural network architectures introduced by ResNet to train deeper policy and value networks.
- cite ← Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network — SRGAN's generator is built from residual blocks inspired by ResNet's skip-connection architecture.
- cite ← Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization — Grad-CAM demonstrates class-discriminative localization on deep residual networks used for image recognition.
- cite ← Dermatologist-level classification of skin cancer with deep neural networks — The skin-cancer classifier cites residual networks as a major deep convolutional architecture improving image-recognition accuracy.
- cite ← Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks — CycleGAN used residual network blocks inspired by ResNet to build its image translation generators.
- cite ← TensorFlow: a system for large-scale machine learning — TensorFlow cites residual networks as a state-of-the-art deep architecture whose depth benefits from scalable distributed training infrastructure.
- cite ← Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Faster R-CNN cites residual networks as stronger CNN feature extractors for improving object detection accuracy.
- cite ← Squeeze-and-Excitation Networks — Squeeze-and-Excitation Networks insert channel-wise feature recalibration blocks into residual architectures introduced by ResNet.
- cite ← Highly accurate protein structure prediction with AlphaFold — AlphaFold's neural architecture uses residual connections popularized by ResNet to train deep networks for protein-structure inference.
- cite ← Segment Anything — Segment Anything relies on modern deep visual backbones whose development was strongly shaped by residual networks for scalable image representation learning.
- cite ← Momentum Contrast for Unsupervised Visual Representation Learning — MoCo uses ResNet architectures as the encoder backbone for contrastive representation learning.
- enables ← The Pascal Visual Object Classes (VOC) Challenge — PASCAL VOC helped standardize visual-recognition benchmarking and detection tasks that residual networks later improved through very deep convolutional features.
- enables ← Long Short-Term Memory — LSTM's gated additive state path anticipated the identity-like information flow that ResNets used to ease optimization of deep networks.
- enables ← Backpropagation Applied to Handwritten Zip Code Recognition — Backpropagation for convolutional networks supplied the gradient-based training foundation used to optimize deep residual networks.