R3DPA: Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation
Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet
AI summary
Problem
Collecting large-scale, annotated LiDAR datasets is expensive and limited, hindering scalable autonomous driving development, while existing generative models fail to leverage powerful RGB priors or self-supervised 3D features.
Approach
R3DPA aligns a flow-matching generative model's internal representations with self-supervised 3D features and initializes it with RGB image-pretrained weights through a two-stage VAE alignment and end-to-end training process.
Key results
- First method to transfer RGB image-pretrained flow matching weights to LiDAR generation.
- Achieves state-of-the-art performance on KITTI-360, surpassing previous methods by at least 17%.
- Enables controllable scene editing (object inpainting and scene mixing) at inference using an unconditional model.
- End-to-end training with 3D alignment creates a more expressive latent space and significantly improves generation quality.
Why it matters
It bridges the data scarcity gap for 3D LiDAR datasets, enabling more realistic synthetic data generation for training and testing autonomous driving systems.
Abstract
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of- the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.