← Back ICRA 2026

UniFuture: A 4D Driving World Model for Future Generation and Perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Zhang yumeng, Mingyang Du, Xiao Tan, Bai Xiang

PDF

AI summary

Key figure (auto-extracted from paper)

UniFuture unifies future video generation and depth estimation in a single framework, outperforming specialized models by jointly learning appearance and geometry.

4D world model driving simulation depth estimation latent sharing video generation autonomous driving

Problem

Existing driving world models either focus solely on 2D video generation without geometric awareness or on static depth perception without temporal dynamics, failing to capture the full 4D evolution of driving scenes.

Approach

The authors introduce UniFuture, which maps future RGB images and depth maps into a shared spatio-temporal latent space via Dual-Latent Sharing, and enforces bidirectional consistency between texture and structure using a Multi-scale Latent Interaction mechanism.

Key results

Unified 4D driving world model framework bridging appearance and geometry
Dual-Latent Sharing scheme for shared spatio-temporal latent representation
Multi-scale Latent Interaction mechanism enforcing bidirectional spatio-temporal consistency
State-of-the-art performance on nuScenes and Waymo, reducing FID by 23.9% versus Vista while outperforming specialized depth estimators

Why it matters

Enables autonomous driving systems to anticipate physically consistent 4D scene evolution, improving spatial reasoning, scenario simulation, and annotated data generation.

Abstract

We present UniFuture, a unified 4D Driving World Model designed to simulate the dynamic evolution of the 3D physical world. Unlike existing driving world models that focus solely on 2D pixel-level video generation (lacking geometry) or static perception (lacking temporal dynamics), our approach bridges appearance and geometry to construct a holistic 4D representation. Specifically, we treat future RGB images and depth maps as coupled projections of the same 4D reality and model them jointly within a single framework. To achieve this, we introduce a Dual-Latent Sharing (DLS) scheme, which maps visual and geometric modalities into a shared spatio-temporal latent space, implicitly entangling texture with structure. Furthermore, we propose a Multi-scale Latent Interaction (MLI) mechanism, which enforces bidirectional consistency: geometry constrains visual synthesis to prevent structural hallucinations, while visual semantics refine geometric estimation. During inference, UniFuture can forecast high- fidelity, geometrically consistent 4D scene sequences (image- depth pairs) from a single current frame. Extensive experiments on the nuScenes and Waymo datasets demonstrate that our method outperforms specialized models in both future generation and geometry perception, highlighting the efficacy of unified 4D modeling for autonomous driving. The code is available at https://github.com/dk-liang/UniFuture.

Index terms

Parallel Robots Force Control Field Robots