← Back ICRA 2026

Self-Supervised Point Cloud Single Object Tracking

Yuheng Liu, Le Hui, Ziyue Zhu, Shaohui Mei, Yigong Zhang, Jin Xie, Jian Yang

PDF

AI summary

Key figure (auto-extracted from paper)

The first fully self-supervised framework for 3D point cloud single object tracking matches or exceeds supervised trackers by leveraging motion, geometry, and semantic cues without manual labels.

Self-supervised learning Point cloud tracking Single object tracking Scene flow Autonomous driving Pseudo-labeling

Problem

Current 3D single object tracking relies on expensive frame-by-frame manual annotations, hindering scalability to large unlabeled LiDAR datasets. Adapting self-supervised methods to sparse 3D data is challenging due to unreliable appearance matching and lack of intermediate supervision.

Approach

The method generates pseudo labels by clustering local scene flow, pre-trains a proposal network via point cloud forecasting to capture global motion and geometry, iteratively refines labels with semantic prototypes, and tracks targets using a predictive motion filter.

Key results

First fully self-supervised point cloud single object tracking framework
Matches or exceeds fully supervised trackers on KITTI, nuScenes, and Waymo
Introduces iterative semantic prototype refinement for accurate pseudo-label generation
Demonstrates scalable label-free tracking using motion, geometry, and semantic cues

Why it matters

Eliminates the need for costly manual annotations, enabling scalable 3D object tracking for autonomous driving perception.

Abstract

Point cloud single object tracking is critical in autonomous driving. However, current methods heavily rely on frame-by-frame human annotations, which do not scale well with the growing amount of unlabeled LiDAR data. In this paper, we propose the first self-supervised point cloud single object tracking framework, eliminating the need for any manual labels. Our method integrates motion, geometry, and semantic cues to generate plausible object proposals and tracks the target using a predictive filter. Specifically, we generate pseudo labels by clustering local motion patterns from scene flow, while pre- training a proposal network using point cloud forecasting as a proxy task to learn global motion patterns and geometric shape priors. Then, we train the proposal network using the initial pseudo labels and iteratively refine them by treating semantic features as evolving prototypes in each training round. Finally, a simple motion filter is employed to predict the target’s current state based on its past dynamics. Evaluated on KITTI, nuScenes, and Waymo, our self-supervised point cloud single object tracking approach is on par with—and in some cases outperforms—fully supervised trackers, demonstrating that self- supervision is a scalable path forward for 3D single object tracking.

Index terms

Visual Tracking Deep Learning Methods