Segment Anything¶
Why this mattered¶
TBD
Abstract¶
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at segment-anything.com to foster research into foundation models for computer vision. We recommend reading the full paper at: arxiv.org/abs/2304.02643.
Related¶
- cite → Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Segment Anything builds on the region-proposal object detection lineage exemplified by Faster R-CNN's RPN-based instance localization.
- cite → Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — Segment Anything connects to R-CNN through the shared framing of segmentation as a vision task requiring object-level masks from learned image features.
- cite → Snakes: Active contour models — Segment Anything cites active contours as an early interactive segmentation approach where user-guided boundaries define object masks.
- cite → A Computational Approach to Edge Detection — Segment Anything relates to Canny edge detection through the longstanding use of image boundaries as cues for object segmentation.
- cite → Deep Residual Learning for Image Recognition — Segment Anything relies on modern deep visual backbones whose development was strongly shaped by residual networks for scalable image representation learning.
- cite → The Cityscapes Dataset for Semantic Urban Scene Understanding — Segment Anything contrasts its broad mask dataset and promptable segmentation task with domain-specific semantic segmentation benchmarks such as Cityscapes.
- enables ← Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Faster R-CNN's region proposal framing enabled SAM's promptable segmentation pipeline by establishing object localization as a reusable precursor to mask prediction.
- enables ← Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation — R-CNN linked CNN feature extraction with region-level recognition, enabling SAM's use of learned visual features for object-level segmentation masks.
- enables ← Snakes: Active contour models — Active contours introduced interactive boundary refinement, enabling SAM's prompt-driven mask generation around user-specified objects.
- enables ← A Computational Approach to Edge Detection — Canny edge detection formalized image boundaries as computable cues, enabling SAM's mask decoder to rely on boundary-sensitive visual representations.
- enables ← Deep Residual Learning for Image Recognition — Residual learning enabled very deep vision backbones, supporting SAM's high-capacity image encoder for general-purpose segmentation.
- enables ← The Cityscapes Dataset for Semantic Urban Scene Understanding — Cityscapes popularized large-scale dense segmentation benchmarks, motivating the broad mask-quality evaluation culture used by SAM.