OccTENS: 3D Occupancy World Model Via Temporal Next-Scale Prediction
Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin
AI summary
Problem
Existing autoregressive occupancy world models suffer from computational inefficiency, temporal degradation in long-term generation, and a lack of pose controllability for autonomous driving scenarios.
Approach
The authors reformulate occupancy generation as a temporal next-scale prediction task using a TensFormer architecture that decouples scene-by-scene temporal prediction from scale-by-scale spatial generation, integrated with a holistic camera pose aggregation module for unified motion planning and control.
Key results
- Superior occupancy prediction quality over state-of-the-art methods
- Significantly faster inference time compared to autoregressive baselines
- Simultaneous pose controllability and motion planning via unified sequence modeling
- Effective mitigation of temporal degradation in long-term sequence generation
Why it matters
Provides autonomous driving systems with a computationally efficient and controllable world model for robust long-term scene forecasting and trajectory planning.
Abstract
In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computa- tional efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on au- toregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next- scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale- by-scale generation and temporal scene-by-scene prediction. With a TensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego- motion. Experiments show that OccTENS outperforms the state- of-the-art method with both higher occupancy quality and faster inference time.