Momentum Contrast for Unsupervised Visual Representation Learning¶
Why this mattered¶
TBD
Abstract¶
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
Related¶
- cite → Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — MoCo evaluates its unsupervised visual representations by transferring them to Faster R-CNN for object detection.
- cite → The Pascal Visual Object Classes (VOC) Challenge — MoCo evaluates learned visual representations by transfer to object detection on the PASCAL VOC benchmark.
- cite → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — MoCo uses the R-CNN detection framework as a downstream test of whether unsupervised features transfer to object detection.
- cite → ImageNet: A large-scale hierarchical image database — MoCo pretrains and evaluates visual representations on ImageNet-scale image classification data.
- cite → Backpropagation Applied to Handwritten Zip Code Recognition — MoCo cites early convolutional backpropagation work as part of the lineage of learned visual features.
- cite → Deep Residual Learning for Image Recognition — MoCo uses ResNet architectures as the encoder backbone for contrastive representation learning.
- cite → The Cityscapes Dataset for Semantic Urban Scene Understanding — MoCo evaluates transfer of learned features to semantic segmentation on the Cityscapes urban-scene benchmark.
- enables ← The Pascal Visual Object Classes (VOC) Challenge — PASCAL VOC provided standard object-recognition benchmarks and evaluation practice that helped define downstream transfer tests for MoCo visual representations.
- enables ← Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — R-CNN demonstrated that deep convolutional features transfer effectively to detection, motivating MoCo to learn transferable visual representations without labels.
- enables ← ImageNet: A large-scale hierarchical image database — ImageNet supplied the large-scale visual dataset and supervised-pretraining benchmark that MoCo used as the central comparison point for self-supervised learning.
- enables ← Backpropagation Applied to Handwritten Zip Code Recognition — Backpropagation for convolutional handwritten-recognition networks established end-to-end gradient training of visual feature extractors that modern contrastive encoders rely on.