← Back ICRA 2026

OccLLaMA: A Unified Occupancy-Language-Action World Model for Enhancing Motion Planning Via Multi-Task Learning

Julong Wei, Shanshuai Yuan, Pengfei Li, Xinyi Quan, Lei Tai, Jieru Zhao, Zhongxue Gan, Wenchao Ding

PDF

AI summary

Key figure (auto-extracted from paper)

Unifying 3D semantic occupancy, language, and action tokens in a single autoregressive model significantly enhances both scene understanding and motion planning for autonomous driving.

Occupancy-Language-Action World Model Autonomous Driving Multi-Task Learning Semantic Occupancy Motion Planning

Problem

Existing autonomous driving models struggle to simultaneously integrate 3D spatial reasoning, language understanding, and action planning, often treating these tasks in isolation or lacking comprehensive world dynamics forecasting.

Approach

OccLLaMA compresses 3D semantic occupancy into discrete tokens via a sparse tokenizer, then aligns them with language and action tokens within a unified autoregressive LLaMA framework to jointly learn scene understanding, forecasting, and planning.

Key results

Achieves state-of-the-art spatial reasoning on NuScenesQA
Delivers competitive performance across understanding, forecasting, and planning tasks
Proves multi-task learning and chain-of-thought reasoning boost motion planning
Introduces a decoupled tokenizer that resolves 3D occupancy sparsity

Why it matters

Offers a scalable foundation model for autonomous driving that bridges 3D spatial reasoning, language, and planning, enabling more robust and explainable real-world navigation.

Abstract

Scene understanding via multi-modal large lan- guage models and scene forecasting with world models have advanced the development of autonomous driving. The former maps visual inputs to driving-specific outputs, neglecting spatial reasoning and world dynamics. The latter captures world dynamics, lacking comprehensive scene understanding. In con- trast, human divers seamlessly integrate understanding, fore- casting, and decision-making through multi-modal representa- tions. To this end, we propose OccLLaMA, a unified occupancy- language-action world model to enhance motion planning via multi-task learning. It uses semantic occupancy as a unified 3D visual representation, effectively integrating spatial scene understanding and forecasting. Specifically, we first introduce a tailored scene tokenizer that auto-encodes semantic occupancy into latent tokens for invertible compression. Furthermore, we enhance LLaMA to enable joint learning across both understanding and generation tasks within a unified auto- regressive framework, incorporating multi-task pretraining and motion-planning–oriented fine-tuning. Extensive experiments demonstrate that OccLLaMA not only achieves competitive performance on scene understanding and occupancy forecast- ing, but also enhances motion planning by integrating multi- task inference, showcasing its effectiveness and potential as a foundation model for autonomous driving. Project page: OccLLaMA

Index terms

Computer Vision for Automation Deep Learning for Visual Perception Autonomous Vehicle Navigation