Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation¶
Why this mattered¶
TBD
Abstract¶
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.
Related¶
- cite → The Pascal Visual Object Classes (VOC) Challenge — R-CNN evaluates object detection and segmentation performance on the PASCAL VOC benchmark introduced by the VOC Challenge.
- cite → Selective Search for Object Recognition — R-CNN uses Selective Search to generate category-independent region proposals before CNN feature extraction.
- cite → ImageNet: A large-scale hierarchical image database — R-CNN relies on ImageNet pretraining to initialize convolutional networks for object detection.
- cite → Gradient-based learning applied to document recognition — R-CNN builds on the convolutional neural network architecture popularized by LeNet for visual recognition tasks.
- cite → Backpropagation Applied to Handwritten Zip Code Recognition — R-CNN cites early backpropagation-trained convolutional networks as the historical basis for learned visual feature hierarchies.
- cite → Distinctive Image Features from Scale-Invariant Keypoints — R-CNN contrasts learned CNN features with SIFT-style hand-engineered local image descriptors.
- cite → Histograms of Oriented Gradients for Human Detection — R-CNN contrasts its learned region features with HOG descriptors used in earlier object detection pipelines.
- cite → ImageNet classification with deep convolutional neural networks — R-CNN adopts the AlexNet-style deep convolutional network breakthrough from ImageNet classification for detection feature extraction.
- enables → Segment Anything — R-CNN linked CNN feature extraction with region-level recognition, enabling SAM's use of learned visual features for object-level segmentation masks.
- enables → Momentum Contrast for Unsupervised Visual Representation Learning — R-CNN demonstrated that deep convolutional features transfer effectively to detection, motivating MoCo to learn transferable visual representations without labels.
- cite ← The Cityscapes Dataset for Semantic Urban Scene Understanding — Cityscapes cites R-CNN for using convolutional feature hierarchies in object detection and semantic segmentation.
- cite ← Going deeper with convolutions — GoogLeNet builds on R-CNN's demonstration that convolutional feature hierarchies improve object detection and segmentation.
- cite ← Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization — Grad-CAM uses CNN object-recognition architectures exemplified by R-CNN as targets for class-discriminative visual explanations.
- cite ← Learning Deep Features for Discriminative Localization — CAM contrasts weakly supervised localization with R-CNN-style object detection that depends on region proposals and bounding-box supervision.
- cite ← Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification — The rectifier-network paper uses R-CNN as the detection framework in which its ImageNet-trained features are transferred to object detection.
- cite ← Deep Residual Learning for Image Recognition — ResNet cites R-CNN as a prior deep feature hierarchy for object detection and semantic segmentation.
- cite ← Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Faster R-CNN builds directly on R-CNN's idea of using CNN feature hierarchies for region-based object detection.
- cite ← Segment Anything — Segment Anything connects to R-CNN through the shared framing of segmentation as a vision task requiring object-level masks from learned image features.
- cite ← ImageNet Large Scale Visual Recognition Challenge — ILSVRC cites R-CNN as a leading detection approach combining Selective Search proposals with convolutional neural-network features.
- cite ← Momentum Contrast for Unsupervised Visual Representation Learning — MoCo uses the R-CNN detection framework as a downstream test of whether unsupervised features transfer to object detection.
- enables ← The Pascal Visual Object Classes (VOC) Challenge — PASCAL VOC provided the object-detection benchmark and evaluation protocol on which R-CNN demonstrated region-based CNN detection gains.
- enables ← ImageNet: A large-scale hierarchical image database — ImageNet pretraining supplied the large-scale labeled visual representations that R-CNN fine-tuned for object detection and segmentation.
- enables ← Gradient-based learning applied to document recognition — LeCun et al. showed that convolutional networks can learn hierarchical visual features end-to-end, which R-CNN reused through a deep CNN feature extractor for object proposals.
- enables ← Backpropagation Applied to Handwritten Zip Code Recognition — The zip-code recognition paper established backpropagation-trained convolutional filters for image recognition, a core mechanism behind the CNN features used in R-CNN.
- enables ← Distinctive Image Features from Scale-Invariant Keypoints — SIFT popularized scale-invariant local image descriptors, providing a hand-crafted feature baseline that R-CNN surpassed with learned convolutional region features.
- enables ← Histograms of Oriented Gradients for Human Detection — HOG demonstrated that oriented-gradient descriptors are effective for object detection, setting the feature-engineering context that R-CNN replaced with learned CNN representations.