Learning Deep Features for Discriminative Localization¶
Why this mattered¶
Before this paper, CNNs were usually treated as powerful classifiers whose spatial reasoning had to be supervised separately, typically with bounding boxes, segmentation masks, or specialized detection architectures. Zhou et al. showed that a classifier trained only with image-level labels could still reveal where it was “looking” by replacing fully connected layers with global average pooling and projecting class weights back onto the final convolutional feature maps. The resulting class activation maps made weakly supervised localization practical and simple: the same network trained for recognition could identify discriminative image regions without explicit localization annotations.
The paradigm shift was interpretability becoming operational. Class activation maps were not merely post hoc visualizations; they exposed a reusable mechanism for turning classification features into spatial evidence. That made it newly possible to audit CNN decisions, build weakly supervised object localization systems, and study the relationship between recognition and attention using only classification-trained models. The paper’s reported ILSVRC localization result was important less as a final detector than as evidence that high-level convolutional features preserved enough spatial structure to support localization without bounding boxes.
This idea became a foundation for later work on visual explanations and weak supervision. Grad-CAM and related methods generalized the same intuition to broader architectures without requiring global average pooling, while weakly supervised detection, segmentation, medical-imaging triage, and failure-analysis workflows adopted activation-map-style evidence as a standard diagnostic tool. In retrospect, the paper helped shift deep learning practice from asking only whether a model predicted the right label to asking which image evidence supported that prediction, a question that became central to trustworthy computer vision.
Abstract¶
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation. We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1.
Related¶
- cite → Going deeper with convolutions — Class activation mapping is demonstrated on GoogLeNet-style convolutional networks with global average pooling.
- cite → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — CAM contrasts weakly supervised localization with R-CNN-style object detection that depends on region proposals and bounding-box supervision.
- cite → ImageNet Large Scale Visual Recognition Challenge — CAM evaluates discriminative localization using models trained on the ImageNet large-scale classification and localization benchmark.
- cite → ImageNet classification with deep convolutional neural networks — CAM builds on the finding from AlexNet that deep convolutional networks learn discriminative visual features for ImageNet classification.
- cite ← Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization — Grad-CAM generalizes CAM-style discriminative localization by replacing architecture-specific global-average-pooling weights with gradient-derived weights.