Self-Supervised Point Cloud Single Object Tracking
Yuheng Liu, Le Hui, Ziyue Zhu, Shaohui Mei, Yigong Zhang, Jin Xie, Jian Yang
AI summary
Problem
Current 3D single object tracking relies on expensive frame-by-frame manual annotations, hindering scalability to large unlabeled LiDAR datasets. Adapting self-supervised methods to sparse 3D data is challenging due to unreliable appearance matching and lack of intermediate supervision.
Approach
The method generates pseudo labels by clustering local scene flow, pre-trains a proposal network via point cloud forecasting to capture global motion and geometry, iteratively refines labels with semantic prototypes, and tracks targets using a predictive motion filter.
Key results
- First fully self-supervised point cloud single object tracking framework
- Matches or exceeds fully supervised trackers on KITTI, nuScenes, and Waymo
- Introduces iterative semantic prototype refinement for accurate pseudo-label generation
- Demonstrates scalable label-free tracking using motion, geometry, and semantic cues
Why it matters
Eliminates the need for costly manual annotations, enabling scalable 3D object tracking for autonomous driving perception.
Abstract
Point cloud single object tracking is critical in autonomous driving. However, current methods heavily rely on frame-by-frame human annotations, which do not scale well with the growing amount of unlabeled LiDAR data. In this paper, we propose the first self-supervised point cloud single object tracking framework, eliminating the need for any manual labels. Our method integrates motion, geometry, and semantic cues to generate plausible object proposals and tracks the target using a predictive filter. Specifically, we generate pseudo labels by clustering local motion patterns from scene flow, while pre- training a proposal network using point cloud forecasting as a proxy task to learn global motion patterns and geometric shape priors. Then, we train the proposal network using the initial pseudo labels and iteratively refine them by treating semantic features as evolving prototypes in each training round. Finally, a simple motion filter is employed to predict the target’s current state based on its past dynamics. Evaluated on KITTI, nuScenes, and Waymo, our self-supervised point cloud single object tracking approach is on par with—and in some cases outperforms—fully supervised trackers, demonstrating that self- supervision is a scalable path forward for 3D single object tracking.