← Back ICRA 2026

Enhancing Indoor Occupancy Prediction Via Sparse Query-Based Multi-Level Consistent Knowledge Distillation

Xiang Li, Yupeng Zheng∗, Pengfei Li, Yilun Chen, Ya-Qin Zhang, Wenchao Ding

PDF

AI summary

Key figure (auto-extracted from paper)

DiScene achieves state-of-the-art indoor occupancy prediction accuracy and real-time speed by leveraging multi-level knowledge distillation on sparse queries.

3D Occupancy Prediction Knowledge Distillation Sparse Queries Real-Time Perception Indoor Scene Understanding Model Compression

Problem

Indoor occupancy prediction methods face a critical accuracy-efficiency trade-off: dense approaches waste computation on empty space, while sparse query-based methods lack robustness and struggle with effective knowledge transfer during distillation.

Approach

The authors introduce DiScene, a sparse query-based framework that aligns teacher and student models across four levels (encoder features, queries, spatial priors, and high-confidence anchors) and uses teacher-guided initialization to accelerate training without extra inference costs.

Key results

Surpasses baseline OPUS by 36.1% mIoU at 23.2 FPS without depth priors
Matches depth-enhanced OPUS† performance while maintaining real-time speed
Achieves new SOTA on Occ-ScanNet with depth integration, outperforming EmbodiedOcc by 3.7% with 1.62× faster inference
Demonstrates robust cross-domain generalization on Occ3D-nuScenes and in-the-wild datasets

Why it matters

Provides a computationally efficient pathway for real-time 3D scene understanding, enabling practical deployment of indoor robots and autonomous systems without heavy depth model dependencies.

Abstract

Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency- accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder- level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Furthermore, experiments on the Occ3D- nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at https://github.com/getterupper/DiScene.

Index terms

Semantic Scene Understanding Vision-Based Navigation