Skip to content

Learning Deep Features for Discriminative Localization

Why this mattered

Before this paper, CNNs were usually treated as powerful classifiers whose spatial reasoning had to be supervised separately, typically with bounding boxes, segmentation masks, or specialized detection architectures. Zhou et al. showed that a classifier trained only with image-level labels could still reveal where it was “looking” by replacing fully connected layers with global average pooling and projecting class weights back onto the final convolutional feature maps. The resulting class activation maps made weakly supervised localization practical and simple: the same network trained for recognition could identify discriminative image regions without explicit localization annotations.

The paradigm shift was interpretability becoming operational. Class activation maps were not merely post hoc visualizations; they exposed a reusable mechanism for turning classification features into spatial evidence. That made it newly possible to audit CNN decisions, build weakly supervised object localization systems, and study the relationship between recognition and attention using only classification-trained models. The paper’s reported ILSVRC localization result was important less as a final detector than as evidence that high-level convolutional features preserved enough spatial structure to support localization without bounding boxes.

This idea became a foundation for later work on visual explanations and weak supervision. Grad-CAM and related methods generalized the same intuition to broader architectures without requiring global average pooling, while weakly supervised detection, segmentation, medical-imaging triage, and failure-analysis workflows adopted activation-map-style evidence as a standard diagnostic tool. In retrospect, the paper helped shift deep learning practice from asking only whether a model predicted the right label to asking which image evidence supported that prediction, a question that became central to trustworthy computer vision.

Abstract

In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation. We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1.

Sources