Skip to content

The Cityscapes Dataset for Semantic Urban Scene Understanding

Why this mattered

Cityscapes mattered because it turned urban scene understanding from a loosely comparable research problem into a benchmarked, high-resolution, street-level perception task with the complexity needed for real autonomous-driving systems. Earlier datasets had supported semantic segmentation, but Cityscapes combined dense pixel-level labels, instance annotations, stereo video context, and scenes from 50 cities, making it possible to evaluate algorithms on fine-grained urban categories such as road, sidewalk, person, rider, vehicle, traffic sign, and vegetation under realistic variation. The paper’s central shift was not a new model, but a new experimental substrate: it made dense urban perception measurable at a scale and annotation quality that matched the ambitions of deep learning.

After Cityscapes, semantic segmentation for driving could be trained and compared against a shared target rather than demonstrated only on small or less varied datasets. This directly supported the rise of fully convolutional, encoder-decoder, dilated-convolution, pyramid-pooling, and later transformer-based segmentation systems, many of which used Cityscapes as a standard proof point. Its coarse annotations also made weakly supervised and semi-supervised learning more concrete: researchers could ask how much dense manual labeling was truly necessary when larger quantities of cheaper labels were available.

The dataset also helped define what “scene understanding” meant for autonomous driving: not just detecting objects, but assigning every pixel and distinguishing object instances where relevant. That framing influenced later benchmarks and datasets for driving perception, including larger and more multimodal efforts, but Cityscapes remained a reference because it captured a clean, rigorous version of the task. Its lasting importance is that it made progress in urban semantic perception cumulative: methods could be compared, failure modes could be studied, and improvements could be tied to a demanding public benchmark rather than isolated demonstrations.

Abstract

Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations, 20 000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

Sources