Research Analyzer
← Back ICRA 2026

OccTENS: 3D Occupancy World Model Via Temporal Next-Scale Prediction

Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin

PDF

AI summary

Key figure (auto-extracted from paper)
OccTENS enables controllable, high-fidelity long-term 3D occupancy generation with faster inference by decoupling temporal and spatial modeling via a novel temporal next-scale prediction framework.
Occupancy Generation World Model Temporal Next-Scale Prediction Autonomous Driving Pose Controllability

Problem

Existing autoregressive occupancy world models suffer from computational inefficiency, temporal degradation in long-term generation, and a lack of pose controllability for autonomous driving scenarios.

Approach

The authors reformulate occupancy generation as a temporal next-scale prediction task using a TensFormer architecture that decouples scene-by-scene temporal prediction from scale-by-scale spatial generation, integrated with a holistic camera pose aggregation module for unified motion planning and control.

Key results

  • Superior occupancy prediction quality over state-of-the-art methods
  • Significantly faster inference time compared to autoregressive baselines
  • Simultaneous pose controllability and motion planning via unified sequence modeling
  • Effective mitigation of temporal degradation in long-term sequence generation

Why it matters

Provides autonomous driving systems with a computationally efficient and controllable world model for robust long-term scene forecasting and trajectory planning.

Abstract

In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computa- tional efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on au- toregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next- scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale- by-scale generation and temporal scene-by-scene prediction. With a TensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego- motion. Experiments show that OccTENS outperforms the state- of-the-art method with both higher occupancy quality and faster inference time.

Index terms

Intelligent Transportation Systems Computer Vision for Transportation Deep Learning for Visual Perception

Related papers