← Back IROS 2024

NeuralFloors++: Consistent Street-Level Scene Generation from BEV Semantic Maps

Valentina Musat, Daniele De Martini, Matthew Gadd, Paul Newman

PDF

Abstract

Learning autonomous driving capabilities requires diverse and realistic training data. This has led to exploring generative techniques as an alternative to real-world data collec- tion. In this paper we propose a method for synthesising photo- realistic urban driving scenes, along with semantic, instance and depth ground-truth. Our model relies on Bird’s Eye View (BEV) representations due to their compositionality and scene content control capabilities, reducing the need for traditional simulators. We employ a two-stage process: first, a 3D scene representation is extracted from BEV semantic, instance and style maps using a neural field. After rendering the semantic, instance, depth and style maps from a ground-view perspective, a second stage based on a diffusion model is used to generate the photo-realistic scene. We extend our prior work - NeuralFloors, to include multiple-view outputs, style manipulation for finer control at the object level through instance-wise style maps and cross-frame consistency via auto-regressive training. The proposed system is evaluated extensively on the KITTI-360 dataset, showing improved realism and semantic alignment for generated images.

Index terms

Deep Learning for Visual Perception Computer Vision for Transportation Semantic Scene Understanding