← Back ICRA 2026

R3DPA: Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation

Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet

PDF

AI summary

Key figure (auto-extracted from paper)

R3DPA unlocks image-pretrained priors and self-supervised 3D features to generate high-fidelity LiDAR scenes, achieving state-of-the-art results on KITTI-360.

LiDAR generation flow matching representation alignment RGB priors 3D self-supervised learning autonomous driving

Problem

Collecting large-scale, annotated LiDAR datasets is expensive and limited, hindering scalable autonomous driving development, while existing generative models fail to leverage powerful RGB priors or self-supervised 3D features.

Approach

R3DPA aligns a flow-matching generative model's internal representations with self-supervised 3D features and initializes it with RGB image-pretrained weights through a two-stage VAE alignment and end-to-end training process.

Key results

First method to transfer RGB image-pretrained flow matching weights to LiDAR generation.
Achieves state-of-the-art performance on KITTI-360, surpassing previous methods by at least 17%.
Enables controllable scene editing (object inpainting and scene mixing) at inference using an unconditional model.
End-to-end training with 3D alignment creates a more expressive latent space and significantly improves generation quality.

Why it matters

It bridges the data scarcity gap for 3D LiDAR datasets, enabling more realistic synthetic data generation for training and testing autonomous driving systems.

Abstract

LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of- the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.

Index terms

Deep Learning for Visual Perception Representation Learning Range Sensing