Skip to content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Why this mattered

Faster R-CNN mattered because it removed a central architectural compromise in early deep object detection: the detector was learned, but the candidate object regions were still typically produced by a separate, hand-engineered proposal method such as Selective Search. Fast R-CNN had made classification and bounding-box refinement much faster, but proposal generation remained an external bottleneck. Ren, He, Girshick, and Sun’s key move was to make region proposal itself a neural-network task, sharing convolutional features with the detector. This turned “where to look” from a preprocessing step into a trainable component of the detection system.

That shift made high-accuracy object detection more unified, faster, and easier to improve end-to-end. The paper showed that a Region Proposal Network could produce a small number of high-quality proposals, about 300 per image, while preserving or improving accuracy on PASCAL VOC and MS COCO. With VGG-16, the full system reached roughly 5 frames per second on a GPU, which was not yet commodity real-time video perception, but it changed the practical regime: accurate detection could now be approached as a single shared-feature model rather than a pipeline stitched together from unrelated algorithms.

Its influence extended beyond the particular two-stage detector it introduced. Faster R-CNN became the template for a major branch of modern detection systems: learned proposal generation, shared backbones, anchor boxes, objectness prediction, and joint localization-classification training. Later systems such as Mask R-CNN built directly on its region-based framework, while one-stage detectors such as SSD, YOLO variants, RetinaNet, and later anchor-free methods can also be understood partly as responses to the same question it sharpened: how much of object detection can be made dense, learned, and integrated into the network itself? In that sense, the paper helped move object detection from feature engineering plus neural classification toward the fully learned visual recognition systems that became standard in computer vision.

Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features-using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3] , our detection system has a frame rate of 5 fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

Sources