Using nuScenes with vision3d#

This example demonstrates using the nuScenes dataset (mini-split) with vision3d.datasets.NuScenes3D. It covers inspecting the SampleInputs, SampleTargets tuple returned by the dataset, batching with vision3d.datasets.collate_fn() for training, and visualizing a frame with vision3d.viz.log_sample().

Construct the dataset#

NuScenes3D yields sample frames describing the 3D scene. Each sample carries lidar points, all six camera images, their intrinsics and extrinsics, and 3D bounding-box annotations of the objects in the scene.

from pathlib import Path

from vision3d.datasets import NuScenes3D

NUSCENES_ROOT = Path("~/.cache/vision3d/nuscenes-mini").expanduser()

dataset = NuScenes3D(NUSCENES_ROOT, version="v1.0-mini", split="train", download=True)
print(f"len(dataset) = {len(dataset)}")
print(f"classes ({len(dataset.classes)}): {dataset.classes}")

len(dataset) = 323
classes (10): ('car', 'truck', 'bus', 'trailer', 'construction_vehicle', 'pedestrian', 'motorcycle', 'bicycle', 'traffic_cone', 'barrier')

Inspect a sample#

A single index returns a (inputs, targets) tuple where inputs is a FusionInputs dict and targets is a SampleTargets dict. Most values are semantic tensor types from vision3d.tensors (PointCloud3D, CameraImages, BoundingBoxes3D, …) so vision3d.transforms can dispatch to the right operation per input.

inputs, targets = dataset[0]

print("inputs:")
print(
    f"  points: type={type(inputs['points']).__name__} "
    f"shape={tuple(inputs['points'].shape)} dtype={inputs['points'].dtype}"
)
print(
    f"  images: type={type(inputs['images']).__name__} "
    f"shape={tuple(inputs['images'].shape)} dtype={inputs['images'].dtype}"
)
print(
    f"  intrinsics: type={type(inputs['intrinsics']).__name__} "
    f"shape={tuple(inputs['intrinsics'].shape)} dtype={inputs['intrinsics'].dtype}"
)
print(
    f"  extrinsics: type={type(inputs['extrinsics']).__name__} "
    f"shape={tuple(inputs['extrinsics'].shape)} dtype={inputs['extrinsics'].dtype}"
)

print("targets:")
print(
    f"  boxes: type={type(targets['boxes']).__name__} "
    f"shape={tuple(targets['boxes'].shape)} dtype={targets['boxes'].dtype} "
    f"format={targets['boxes'].format.name}"
)
print(
    f"  labels: type={type(targets['labels']).__name__} "
    f"shape={tuple(targets['labels'].shape)} dtype={targets['labels'].dtype}"
)

inputs:
  points: type=PointCloud3D shape=(34688, 5) dtype=torch.float32
  images: type=CameraImages shape=(6, 3, 900, 1600) dtype=torch.float32
  intrinsics: type=CameraIntrinsics shape=(6, 3, 3) dtype=torch.float32
  extrinsics: type=CameraExtrinsics shape=(6, 4, 4) dtype=torch.float32
targets:
  boxes: type=BoundingBoxes3D shape=(68, 7) dtype=torch.float32 format=XYZLWHY
  labels: type=Tensor shape=(68,) dtype=torch.int64

Densify with multiple lidar sweeps#

A single lidar key-frame is sparse, which limits 3D detection accuracy. Multi-sweep accumulation increases point density by combining the sweeps captured between key-frames into the current frame. Each sweep is motion-compensated into the key-frame lidar frame, and every point is annotated with its time offset relative to the key-frame. The denser point cloud retains temporal information and improves model performance.

The technique was popularized by the nuScenes dataset, which accumulates 10 sweeps, roughly 0.5 seconds at the 2 Hz key-frame rate.

In vision3d, set num_sweeps on NuScenes3D. Each point gains a trailing time-offset column, so the point cloud grows from [N, 5] to [N, 6] channels (x, y, z, intensity, ring column, time).

import rerun as rr
import rerun.blueprint as rrb

from vision3d.viz import fusion_layout, log_sample

dense = NuScenes3D(NUSCENES_ROOT, version="v1.0-mini", split="train", num_sweeps=10)

# A mid-scene frame. The first frames of a scene have fewer prior sweeps to
# fold in, so aggregation falls back to whatever is available.
frame = 10
single_points = dataset[frame][0]["points"]
dense_points = dense[frame][0]["points"]

print(f"single sweep: {tuple(single_points.shape)} (x,y,z,intensity,ring)")
print(f"10 sweeps:    {tuple(dense_points.shape)} (x,y,z,intensity,ring,time)")
time = dense_points[:, -1]
print(f"time column:  {time.min():.3f} to {time.max():.3f} s before key-frame")

dense_inputs, dense_targets = dense[frame]

rr.init("vision3d_nuscenes_sweeps", spawn=True)
rr.send_blueprint(
    rrb.Blueprint(
        fusion_layout(NuScenes3D.camera_names, NuScenes3D.camera_grid),
        rrb.TimePanel(state="collapsed"),
    )
)
rr.log("world", rr.ViewCoordinates.RIGHT_HAND_Z_UP, static=True)
log_sample(dense_inputs, dense_targets, label_to_id=dense.class_to_idx, jpeg_quality=75)

single sweep: (34688, 5) (x,y,z,intensity,ring)
10 sweeps:    (346880, 6) (x,y,z,intensity,ring,time)
time column:  0.000 to 0.450 s before key-frame

Batch with `vision3d.datasets.collate_fn()`#

Variable-size tensors (point clouds, per-frame box counts) cannot be stacked along a batch dimension, so vision3d.datasets.collate_fn() returns tuples-of-tensors keyed the same as the per-sample dicts. Pass it as the collate_fn argument to DataLoader whenever you train or evaluate on a vision3d dataset.

from torch.utils.data import DataLoader

from vision3d.datasets import collate_fn

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_fn)
batch_inputs, batch_targets = next(iter(loader))

print(f"batch size: {len(batch_inputs)}")
for i, (inp, tgt) in enumerate(zip(batch_inputs, batch_targets)):
    print(
        f"  sample {i}: "
        f"points={tuple(inp['points'].shape)} "
        f"boxes={tuple(tgt['boxes'].shape)}"
    )

batch size: 2
  sample 0: points=(34688, 5) boxes=(68, 7)
  sample 1: points=(34720, 5) boxes=(77, 7)

Visualize the dataset#

vision3d.viz.log_sample() logs a SampleInputs / SampleTargets pair to Rerun for interactive visualization. See the visualization examples for overlaying detector predictions on the ground truth.

import rerun as rr
import rerun.blueprint as rrb

from vision3d.viz import fusion_layout, log_sample

rr.init("vision3d_nuscenes", spawn=True)
rr.send_blueprint(
    rrb.Blueprint(
        fusion_layout(NuScenes3D.camera_names, NuScenes3D.camera_grid),
        rrb.TimePanel(state="collapsed"),
    )
)
rr.log("world", rr.ViewCoordinates.RIGHT_HAND_Z_UP, static=True)

for frame_idx in range(10):
    f_inputs, f_targets = dataset[frame_idx]
    rr.set_time("frame", sequence=frame_idx)
    log_sample(f_inputs, f_targets, label_to_id=dataset.class_to_idx, jpeg_quality=75)

Total running time of the script: (0 minutes 2.703 seconds)