Using nuScenes with vision3d#

This example demonstrates using the nuScenes dataset (mini-split) with vision3d.datasets.NuScenes3D. It covers inspecting the SampleInputs, SampleTargets tuple returned by the dataset, batching with vision3d.datasets.collate_fn() for training, and visualizing a frame with vision3d.viz.log_sample().

Construct the dataset#

NuScenes3D yields sample frames describing the 3D scene. Each sample carries lidar points, all six camera images, their intrinsics and extrinsics, and 3D bounding-box annotations of the objects in the scene.

from pathlib import Path

from vision3d.datasets import NuScenes3D

NUSCENES_ROOT = Path("~/.cache/vision3d/nuscenes-mini").expanduser()

dataset = NuScenes3D(NUSCENES_ROOT, version="v1.0-mini", split="train", download=True)
print(f"len(dataset) = {len(dataset)}")
print(f"classes ({len(dataset.classes)}): {dataset.classes}")
len(dataset) = 323
classes (10): ['car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier']

Inspect a sample#

A single index returns a (inputs, targets) tuple where inputs is a FusionInputs dict and targets is a SampleTargets dict. Most values are vision3d.tensors subclasses. They tag each tensor with its own semantic type (PointCloud3D, CameraImages, BoundingBoxes3D, …) so vision3d.transforms can dispatch to the right operation per input.

inputs, targets = dataset[0]

print("inputs:")
print(
    f"  points: type={type(inputs['points']).__name__} "
    f"shape={tuple(inputs['points'].shape)} dtype={inputs['points'].dtype}"
)
print(
    f"  images: type={type(inputs['images']).__name__} "
    f"shape={tuple(inputs['images'].shape)} dtype={inputs['images'].dtype}"
)
print(
    f"  intrinsics: type={type(inputs['intrinsics']).__name__} "
    f"shape={tuple(inputs['intrinsics'].shape)} dtype={inputs['intrinsics'].dtype}"
)
print(
    f"  extrinsics: type={type(inputs['extrinsics']).__name__} "
    f"shape={tuple(inputs['extrinsics'].shape)} dtype={inputs['extrinsics'].dtype}"
)

print("targets:")
print(
    f"  boxes: type={type(targets['boxes']).__name__} "
    f"shape={tuple(targets['boxes'].shape)} dtype={targets['boxes'].dtype} "
    f"format={targets['boxes'].format.name}"
)
print(
    f"  labels: type={type(targets['labels']).__name__} "
    f"shape={tuple(targets['labels'].shape)} dtype={targets['labels'].dtype}"
)
inputs:
  points: type=PointCloud3D shape=(34688, 5) dtype=torch.float32
  images: type=CameraImages shape=(6, 3, 900, 1600) dtype=torch.float32
  intrinsics: type=CameraIntrinsics shape=(6, 3, 3) dtype=torch.float32
  extrinsics: type=CameraExtrinsics shape=(6, 4, 4) dtype=torch.float32
targets:
  boxes: type=BoundingBoxes3D shape=(68, 7) dtype=torch.float32 format=XYZLWHY
  labels: type=Tensor shape=(68,) dtype=torch.int64

Batch with vision3d.datasets.collate_fn()#

Variable-size tensors (point clouds, per-frame box counts) cannot be stacked along a batch dimension, so vision3d.datasets.collate_fn() returns tuples-of-tensors keyed the same as the per-sample dicts. Pass it as the collate_fn argument to DataLoader whenever you train or evaluate on a vision3d dataset.

from torch.utils.data import DataLoader

from vision3d.datasets import collate_fn

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_fn)
batch_inputs, batch_targets = next(iter(loader))

print(f"batch size: {len(batch_inputs)}")
for i, (inp, tgt) in enumerate(zip(batch_inputs, batch_targets)):
    print(
        f"  sample {i}: "
        f"points={tuple(inp['points'].shape)} "
        f"boxes={tuple(tgt['boxes'].shape)}"
    )
batch size: 2
  sample 0: points=(34688, 5) boxes=(68, 7)
  sample 1: points=(34720, 5) boxes=(77, 7)

Visualize the dataset#

vision3d.viz.log_sample() logs a SampleInputs / SampleTargets pair to Rerun for interactive visualization.

import rerun as rr
import rerun.blueprint as rrb

from vision3d.viz import fusion_layout, log_sample

rr.init("vision3d_nuscenes", spawn=True)
rr.send_blueprint(
    rrb.Blueprint(
        fusion_layout(NuScenes3D.camera_names, NuScenes3D.camera_grid),
        rrb.TimePanel(state="collapsed"),
    )
)
rr.log("world", rr.ViewCoordinates.RIGHT_HAND_Z_UP, static=True)
rr.log(
    "world/boxes",
    rr.AnnotationContext([(i, name) for name, i in dataset.class_to_idx.items()]),
    static=True,
)

for frame_idx in range(10):
    f_inputs, f_targets = dataset[frame_idx]
    rr.set_time("frame", sequence=frame_idx)
    log_sample(f_inputs, f_targets, label_to_id=dataset.class_to_idx, jpeg_quality=75)

Total running time of the script: (0 minutes 2.411 seconds)