Cityscapes Dataset: Urban Scene Understanding at Its Best

 

๐Ÿ™️ Cityscapes Dataset: Urban Scene Understanding at Its Best

The Cityscapes dataset is a large-scale, richly annotated dataset focused on semantic understanding of urban street scenes. It’s widely used in computer vision for tasks like semantic segmentation, instance segmentation, depth estimation, and scene parsing—particularly in autonomous driving and smart city applications.


๐ŸŒ† What is Cityscapes?

Cityscapes contains high-resolution images of street scenes collected from 50 European cities across different seasons, weather conditions, and times of day. The focus is on pixel-level semantic annotation, especially for objects relevant to urban mobility like roads, pedestrians, cars, traffic signs, and sidewalks.


๐Ÿ“Š Key Statistics

Feature Description
๐Ÿ–ผ️ Number of Images 5,000 finely annotated + 20,000 coarsely labeled
๐Ÿ™️ Resolution 2048×1024 pixels
๐Ÿ›ฃ️ Cities Covered 50 European cities
๐Ÿง  Classes 30+ (19 commonly used for training/benchmarking)
๐Ÿงต Annotations Fine + Coarse annotations, with instance-level masks
๐Ÿ“ Formats Available JSON + PNG masks

๐Ÿงพ Annotation Types

Cityscapes supports multiple types of annotations:

  1. Semantic Segmentation – Per-pixel labeling of 19 urban object classes.

  2. Instance Segmentation – Differentiates between multiple instances of the same object class.

  3. Panoptic Segmentation – Combines semantic and instance segmentation.

  4. Depth Maps – Stereo image pairs provide disparity for depth estimation.

  5. Bounding Boxes – For object detection tasks.

  6. Video Sequences – Available for temporal analysis (e.g., tracking, segmentation over time).


๐ŸŽฏ 19 Key Semantic Classes

The most commonly used subset of classes (for benchmarking) includes:

  • Flat: road, sidewalk

  • Human: person, rider

  • Vehicle: car, truck, bus, train, motorcycle, bicycle

  • Construction: building, wall, fence

  • Object: pole, traffic light, traffic sign

  • Nature: vegetation, terrain

  • Sky: sky

These are color-coded in ground truth masks for easy visualization.


๐Ÿงช Common Tasks & Applications

Task Purpose
Semantic Segmentation Label each pixel with an object class
Instance Segmentation Identify and separate multiple instances of objects
Depth Estimation Reconstruct 3D scene geometry from stereo images
Panoptic Segmentation Combine object detection + pixel-wise labeling
Autonomous Driving Real-time scene understanding for navigation

๐Ÿ’ป Using Cityscapes with Python

๐Ÿงฐ Dataset Structure (Simplified)

cityscapes/
├── leftImg8bit/
│   ├── train/
│   ├── val/
│   └── test/
├── gtFine/
│   ├── train/
│   ├── val/
│   └── test/

๐Ÿ–ผ️ Visualizing Sample Image + Mask

import matplotlib.pyplot as plt
from PIL import Image

img_path = "leftImg8bit/train/cologne/cologne_000000_000019_leftImg8bit.png"
mask_path = "gtFine/train/cologne/cologne_000000_000019_gtFine_labelIds.png"

img = Image.open(img_path)
mask = Image.open(mask_path)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title("Input Image")

plt.subplot(1, 2, 2)
plt.imshow(mask)
plt.title("Segmentation Mask")

plt.show()

๐Ÿง  Models Trained on Cityscapes

Many state-of-the-art semantic segmentation models are trained or benchmarked on Cityscapes:

Model Mean IoU (19 classes) Notes
DeepLabv3+ ~82% Uses atrous convolutions
PSPNet ~81% Pyramid Scene Parsing
HRNet ~81%+ High-resolution network
SegFormer ~82%+ Transformer-based segmentation
Swin Transformer ~83%+ Vision Transformer variant

You can find pre-trained weights for many of these models via TorchHub, MMsegmentation, and Hugging Face.


๐Ÿ”— Download and Resources


๐Ÿงต Summary

Feature Value
Total Images 25,000+ (Fine + Coarse)
Resolution 2048×1024
Number of Classes 30+ (19 used for evaluation)
Key Tasks Segmentation, Depth, Panoptic, Video
Focus Urban street scenes
License Non-commercial research

Cityscapes is the go-to dataset for urban scene understanding. Whether you're building an autonomous driving system or training models for street-level scene parsing, Cityscapes offers the rich annotations and real-world diversity needed for high-quality semantic learning.

Pascal VOC Dataset: A Classic in Computer Vision

 

๐Ÿพ Pascal VOC Dataset: A Classic in Computer Vision

The Pascal Visual Object Classes (VOC) dataset is one of the earliest and most influential benchmarks in computer vision, especially for object detection, image classification, segmentation, and person layout tasks. While newer datasets like COCO have taken the spotlight, Pascal VOC remains highly relevant for learning and benchmarking foundational vision models.


๐Ÿ“ฆ What is Pascal VOC?

The Pascal VOC dataset, created as part of the PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning) project, provides a standardized dataset and evaluation protocol for visual object recognition.

The dataset contains real-life images collected from Flickr and annotated with objects belonging to 20 object categories across various tasks.


๐Ÿ“Š Key Features

Feature Description
๐Ÿ“… Years Available VOC 2007, 2010, 2011, 2012
๐Ÿ–ผ️ Total Images ~11,500 (VOC 2012)
๐Ÿง  Classes 20 (e.g., person, dog, cat, car, bike)
๐Ÿ“Œ Tasks Supported Classification, Detection, Segmentation, Person Layout
๐Ÿ“‚ Format XML annotation per image (Pascal VOC format)

๐Ÿท️ Object Categories

Pascal VOC includes 20 object classes, grouped into categories:

๐Ÿง Person

  • Person

๐Ÿ• Animals

  • Bird, Cat, Cow, Dog, Horse, Sheep

๐Ÿš— Vehicles

  • Aeroplane, Bicycle, Boat, Bus, Car, Motorbike, Train

๐Ÿ›‹️ Indoor Objects

  • Bottle, Chair, Dining table, Potted plant, Sofa, TV/monitor


๐Ÿงช Supported Tasks

๐Ÿ”น 1. Object Classification

Determine whether an object category is present in an image.

๐Ÿ”น 2. Object Detection

Detect the presence and location (bounding boxes) of objects in an image.

๐Ÿ”น 3. Semantic Segmentation

Pixel-wise labeling of object categories in an image.

๐Ÿ”น 4. Person Layout

Locate parts of a person (head, hands, feet, etc.).


๐Ÿ’พ Data Format: VOC XML

Each image is annotated with an XML file that follows the Pascal VOC annotation format, containing:

<annotation>
    <folder>VOC2007</folder>
    <filename>000001.jpg</filename>
    <size>
        <width>353</width>
        <height>500</height>
        <depth>3</depth>
    </size>
    <object>
        <name>dog</name>
        <bndbox>
            <xmin>48</xmin>
            <ymin>240</ymin>
            <xmax>195</xmax>
            <ymax>371</ymax>
        </bndbox>
    </object>
</annotation>

This format is still widely used and supported by many libraries like TensorFlow Object Detection API, YOLO, and Albumentations.


๐Ÿš€ Using VOC for Object Detection

๐Ÿ’ก Tip: Use VOCDetection in PyTorch

from torchvision.datasets import VOCDetection

dataset = VOCDetection(
    root="path/to/VOCdevkit",
    year="2007",
    image_set="train",
    download=True
)

image, target = dataset[0]
print(target)  # Annotation in VOC format

๐Ÿ“‚ Dataset Structure

VOCdevkit/
└── VOC2007/
    ├── JPEGImages/
    ├── Annotations/
    ├── ImageSets/
    └── SegmentationClass/

๐Ÿง  Benchmark Results

Pascal VOC was the go-to benchmark before COCO. Many well-known models were initially validated on VOC:

Model mAP on VOC 2007 Notes
Fast R-CNN ~70.0% Introduced ROI pooling
Faster R-CNN ~73.2% Added Region Proposal Network
SSD ~77.2% Single-shot detection
YOLOv1 ~63.4% Fast, real-time performance
YOLOv3 ~80.0% Modern version

๐Ÿ”ง Labeling Your Own Data in Pascal VOC Format

If you’re creating a custom object detection dataset, many annotation tools support VOC:

These export XML files compatible with TensorFlow and other tools.


๐Ÿ”— Resources


๐Ÿ“˜ Summary

Feature Value
Total Images ~11,000
Classes 20
Tasks Detection, Segmentation, Classification
Format Pascal VOC XML
Supported Tools TensorFlow, PyTorch, YOLO, CVAT

Despite being older, Pascal VOC remains a gold standard for learning object detection. It's smaller and simpler than COCO, making it great for beginners, quick prototyping, or testing custom models.

Keep Traveling

Travel everywhere!

Python

Video/Audio tools

Advertisement

Pages - Menu

Post Page Advertisement [Top]

Climb the mountains