← Back ICRA 2026

OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics

Yinan Deng, Yufeng Yue, Jianyu Dou, Jingyu Zhao, Jiahui Wang, Yujie Tang, Yi Yang, Mengyin Fu

PDF

AI summary

Key figure (auto-extracted from paper)

OmniMap is the first online mapping framework that simultaneously captures high-fidelity optical rendering, precise geometric reconstruction, and open-vocabulary semantic understanding in real-time.

Open-vocabulary mapping 3D Gaussian Splatting Online reconstruction Semantic understanding Robotics Multi-modal perception

Problem

Existing robotic mapping methods typically capture only partial scene attributes and suffer from optical blurring, geometric irregularities, semantic ambiguities, or lack real-time performance.

Approach

OmniMap employs a tightly coupled hybrid 3D Gaussian Splatting and voxel representation, integrating a differentiable camera model for motion blur and exposure compensation, normal-constrained geometry updates, and probabilistic fusion for robust open-vocabulary instance understanding.

Key results

State-of-the-art rendering fidelity, mesh quality, and zero-shot semantic segmentation
Real-time online mapping at 5.55 fps with a compact model size
Support for versatile downstream tasks including scene Q&A, interactive editing, and map-assisted navigation
Novel hybrid 3DGS-Voxel representation ensuring structural stability and fine-grained detail

Why it matters

Provides robotic systems and embodied AI agents with a unified, real-time 3D environmental representation essential for complex perception, manipulation, and navigation tasks.

Abstract

Robotic systems demand accurate and comprehen- sive 3D environment perception, requiring simultaneous capture of photo-realistic appearance (optical), precise layout shape (ge- ometric), and open-vocabulary scene understanding (semantic). Existing methods typically achieve only partial fulfillment of these requirements while exhibiting optical blurring, geometric irreg- ularities, and semantic ambiguities. To address these challenges, we propose OmniMap. Overall, OmniMap represents the first online mapping framework that simultaneously captures optical, geometric, and semantic scene attributes while maintaining real- Manuscript received: 15 May 2025; Accepted 24 August 2025. This article was recommended for publication by Editor Javier Civera upon evaluation of the reviewers’ comments. This work is supported by the National Natural Sci- ence Foundation of China under Grant 92370203, 62473050, 62233002, Bei- jing Natural Science Foundation Undergraduate Research Program QY24180. (Corresponding Author: Yufeng Yue) Yinan Deng, Yufeng Yue, Jianyu Dou, Jingyu Zhao, Jiahui Wang, Yujie Tang, and Yi Yang are with School of Automation, Beijing Institute of Technology, Beijing 100081, China (e-mail: dengyinan@bit.edu.cn; yueyufeng@bit.edu.cn; BruceDou030806@163.com; unique zhao0210@163.com; wjh@bit.edu.cn; 3120235697@bit.edu.cn; yang yi@bit.edu.cn). Mengyin Fu is with the School of Automation, Beijing Institute of Technology, Beijing 100081, China, and the School of Automation, Nanjing University of Science and Technology, Nanjing 210018, China (e-mail: fumy@bit.edu.cn). The project page of OmniMap is available at https://omni-map.github.io/. time performance and model compactness. At the architectural level, OmniMap employs a tightly coupled 3DGS–Voxel hybrid representation that combines fine-grained modeling with struc- tural stability. At the implementation level, OmniMap identifies key challenges across different modalities and introduces several innovations: adaptive camera modeling for motion blur and exposure compensation, hybrid incremental representation with normal constraints, and probabilistic fusion for robust instance- level understanding. Extensive experiments show OmniMap’s superior performance in rendering fidelity, geometric accuracy, and zero-shot semantic segmentation compared to state-of-the- art methods across diverse scenes. The framework’s versatility is further evidenced through a variety of downstream applica- tions, including multi-domain scene Q&A, interactive editing, perception-guided manipulation, and map-assisted navigation.

Index terms

Mapping Semantic Scene Understanding RGB-D Perception Perception for Grasping and Manipulation