LangEditor: Natural Language-Driven 4D Editing for Improved Controllability of Dynamic Driving Scenes
Xiaoyu Liang, Linhui Wang, Chunlam Li, Junhong Lin, WEI GAO
AI summary
Problem
Fully synthetic driving data lacks real-world grounding, while existing editing methods rely on cumbersome manual masks or lack spatiotemporal coherence, hindering the creation of diverse, realistic scenarios for autonomous driving research.
Approach
The framework automatically grounds text instructions to target vehicles, generates physically plausible trajectories, and applies a joint refinement strategy combining dynamic shadow modeling and video diffusion inpainting to ensure spatiotemporal consistency and photorealism.
Key results
- Automatic vehicle selection and trajectory generation via a hierarchical multi-agent LLM workflow
- Dynamic Illumination-Aware Shadow Modeling (DIASM) for consistent lighting across time and space
- Appearance Refinement module using video diffusion inpainting to eliminate rendering artifacts
- State-of-the-art quantitative and qualitative performance in editing quality and controllability over video-based baselines
Why it matters
Bridges the gap between realistic scene editing and user-friendly controllability, offering a scalable tool for data augmentation and simulation in autonomous driving research.
Abstract
Diverse and realistic data are essential for develop- ing reliable autonomous driving (AD) systems, yet collecting and annotating large-scale real-world driving datasets is costly and time-consuming. Recent advances in synthetic scene generation and editing have enabled the creation of diverse driving scenarios. However, fully synthetic scenes often lack real-world grounding, while existing editing approaches are either limited to pure video manipulation or involve cumbersome manual operations. To solve this, we present LangEditor, the natural language-driven 4D editing framework for dynamic driving scenes. LangEditor automatically grounds free-form language instructions to target vehicles and their editable attributes, generating physically plausible trajectories consistent with scene semantics. To ensure spatiotemporal coherence and visual fidelity, we propose a joint refinement strategy that integrates a Dynamic Illumination-Aware Shadow Module for lighting consistency across space-time, and an Appearance Refinement module for synthesizing high-quality textures and material properties. Extensive experiments on realistic driving datasets demonstrate that LangEditor enables intuitive, fine-grained, and photorealistic 4D scene manipulation, outperforming ex- isting baselines in both editing quality and controllability. Our approach bridges the gap between realistic scene editing and user-friendly controllability, offering a powerful tool for data augmentation and simulation in AD research.