Research Analyzer
← Back ICRA 2026

Vision-Centric 4D Occupancy Forecasting and Planning Via Implicit Residual World Models

Jianbiao Mei, Yu Yang, Xuemeng Yang, Licheng Wen, Jiajun Lv, Botian Shi, Yong Liu

PDF

AI summary

Key figure (auto-extracted from paper)
Modeling only residual scene changes instead of full reconstructions significantly boosts both 4D occupancy forecasting and trajectory planning accuracy.
Autonomous driving World models 4D occupancy forecasting Residual prediction End-to-end planning BEV representation

Problem

Existing vision-centric world models waste capacity by fully reconstructing static backgrounds for future scenes, which limits dynamic context encoding and causes error accumulation over time.

Approach

IR-WM constructs a current bird’s-eye-view state and predicts only the residual changes conditioned on ego-actions, using a feature alignment module to correct misalignments and prevent error accumulation.

Key results

  • Achieves state-of-the-art 4D occupancy forecasting on nuScenes
  • Reduces trajectory planning L2 error and collision rates
  • Demonstrates implicit future states substantially improve planning accuracy
  • Shows occupancy-based trajectory filtering adds latency with marginal gains

Why it matters

Offers a more efficient and reliable paradigm for vision-centric autonomous driving by reallocating model capacity to dynamic changes, benefiting end-to-end planning research and deployment.

Abstract

End-to-end autonomous driving systems increas- ingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird’s-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the “residual”, i.e., the changes conditioned on the ego-vehicle’s actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting–planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning. Codes are available at https://github.com/yuyang-cloud/ Drive-OccWorld

Index terms

Visual Learning Computer Vision for Transportation Motion and Path Planning

Related papers