← Back ICRA 2026

LangEditor: Natural Language-Driven 4D Editing for Improved Controllability of Dynamic Driving Scenes

Xiaoyu Liang, Linhui Wang, Chunlam Li, Junhong Lin, WEI GAO

PDF

AI summary

Key figure (auto-extracted from paper)

LangEditor enables intuitive, photorealistic 4D driving scene editing via natural language prompts, outperforming existing baselines in controllability and visual fidelity.

4D scene editing autonomous driving natural language control Gaussian splatting diffusion inpainting scene simulation

Problem

Fully synthetic driving data lacks real-world grounding, while existing editing methods rely on cumbersome manual masks or lack spatiotemporal coherence, hindering the creation of diverse, realistic scenarios for autonomous driving research.

Approach

The framework automatically grounds text instructions to target vehicles, generates physically plausible trajectories, and applies a joint refinement strategy combining dynamic shadow modeling and video diffusion inpainting to ensure spatiotemporal consistency and photorealism.

Key results

Automatic vehicle selection and trajectory generation via a hierarchical multi-agent LLM workflow
Dynamic Illumination-Aware Shadow Modeling (DIASM) for consistent lighting across time and space
Appearance Refinement module using video diffusion inpainting to eliminate rendering artifacts
State-of-the-art quantitative and qualitative performance in editing quality and controllability over video-based baselines

Why it matters

Bridges the gap between realistic scene editing and user-friendly controllability, offering a scalable tool for data augmentation and simulation in autonomous driving research.

Abstract

Diverse and realistic data are essential for develop- ing reliable autonomous driving (AD) systems, yet collecting and annotating large-scale real-world driving datasets is costly and time-consuming. Recent advances in synthetic scene generation and editing have enabled the creation of diverse driving scenarios. However, fully synthetic scenes often lack real-world grounding, while existing editing approaches are either limited to pure video manipulation or involve cumbersome manual operations. To solve this, we present LangEditor, the natural language-driven 4D editing framework for dynamic driving scenes. LangEditor automatically grounds free-form language instructions to target vehicles and their editable attributes, generating physically plausible trajectories consistent with scene semantics. To ensure spatiotemporal coherence and visual fidelity, we propose a joint refinement strategy that integrates a Dynamic Illumination-Aware Shadow Module for lighting consistency across space-time, and an Appearance Refinement module for synthesizing high-quality textures and material properties. Extensive experiments on realistic driving datasets demonstrate that LangEditor enables intuitive, fine-grained, and photorealistic 4D scene manipulation, outperforming ex- isting baselines in both editing quality and controllability. Our approach bridges the gap between realistic scene editing and user-friendly controllability, offering a powerful tool for data augmentation and simulation in AD research.

Index terms

Computer Vision for Transportation Visual Learning Computer Vision for Manufacturing