← Back ICRA 2026

Veila: Panoramic LiDAR Generation from a Monocular RGB Image

Youquan Liu, Lingdong Kong, Weidong Yang, Ao Liang, Jianxiong Gao, Yang Wu, Xiang Xu, Xin Li, Linfeng Li, Runnan Chen, Ben Fei

PDF

AI summary

Key figure (auto-extracted from paper)

Veila enables high-fidelity, controllable panoramic LiDAR generation from a single monocular RGB image, achieving state-of-the-art cross-modal consistency and boosting downstream 3D perception tasks.

Panoramic LiDAR Generation Monocular RGB Conditioning Diffusion Models Cross-Modal Alignment 3D Perception Data Augmentation

Problem

Existing LiDAR generation methods lack fine-grained spatial control, while generating panoramic LiDAR from a single monocular RGB image remains unexplored due to challenges in reliable conditioning, noisy cross-modal alignment, and maintaining global structural coherence.

Approach

Veila is a conditional diffusion framework that adaptively fuses semantic and depth cues from a single RGB image, enforces robust cross-modal alignment during denoising, and applies global self-attention to maintain structural consistency across the entire panoramic field of view.

Key results

State-of-the-art generation fidelity and cross-modal consistency on nuScenes and SemanticKITTI
Novel Cross-Modal Semantic and Depth Consistency metrics for evaluating RGB-LiDAR alignment
Introduction of the KITTI-Weather benchmark for adverse-weather LiDAR generation
Significant improvement in downstream LiDAR semantic segmentation via generative data augmentation

Why it matters

It provides a scalable, low-cost solution for synthesizing diverse 3D LiDAR data, significantly benefiting autonomous driving and robotics research where real-world sensor data is expensive or scarce.

Abstract

Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform uncondi- tional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB ap- pearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in image-LiDAR’s non-overlap regions. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: (i) a Confidence-Aware Condition- ing Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; (ii) Geometric Cross-Modal Align- ment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and (iii) Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics – Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency – to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.

Index terms

Computer Vision for Automation Deep Learning for Visual Perception Sensor Fusion