UniFuture: A 4D Driving World Model for Future Generation and Perception
Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Zhang yumeng, Mingyang Du, Xiao Tan, Bai Xiang
AI summary
Problem
Existing driving world models either focus solely on 2D video generation without geometric awareness or on static depth perception without temporal dynamics, failing to capture the full 4D evolution of driving scenes.
Approach
The authors introduce UniFuture, which maps future RGB images and depth maps into a shared spatio-temporal latent space via Dual-Latent Sharing, and enforces bidirectional consistency between texture and structure using a Multi-scale Latent Interaction mechanism.
Key results
- Unified 4D driving world model framework bridging appearance and geometry
- Dual-Latent Sharing scheme for shared spatio-temporal latent representation
- Multi-scale Latent Interaction mechanism enforcing bidirectional spatio-temporal consistency
- State-of-the-art performance on nuScenes and Waymo, reducing FID by 23.9% versus Vista while outperforming specialized depth estimators
Why it matters
Enables autonomous driving systems to anticipate physically consistent 4D scene evolution, improving spatial reasoning, scenario simulation, and annotated data generation.
Abstract
We present UniFuture, a unified 4D Driving World Model designed to simulate the dynamic evolution of the 3D physical world. Unlike existing driving world models that focus solely on 2D pixel-level video generation (lacking geometry) or static perception (lacking temporal dynamics), our approach bridges appearance and geometry to construct a holistic 4D representation. Specifically, we treat future RGB images and depth maps as coupled projections of the same 4D reality and model them jointly within a single framework. To achieve this, we introduce a Dual-Latent Sharing (DLS) scheme, which maps visual and geometric modalities into a shared spatio-temporal latent space, implicitly entangling texture with structure. Furthermore, we propose a Multi-scale Latent Interaction (MLI) mechanism, which enforces bidirectional consistency: geometry constrains visual synthesis to prevent structural hallucinations, while visual semantics refine geometric estimation. During inference, UniFuture can forecast high- fidelity, geometrically consistent 4D scene sequences (image- depth pairs) from a single current frame. Extensive experiments on the nuScenes and Waymo datasets demonstrate that our method outperforms specialized models in both future generation and geometry perception, highlighting the efficacy of unified 4D modeling for autonomous driving. The code is available at https://github.com/dk-liang/UniFuture.