RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami
AI summary
Problem
Existing 3-D semantic mapping systems lack open-vocabulary flexibility during online operation and struggle to maintain temporal consistency with streaming data, while relying heavily on offline processing or extensive task-specific training.
Approach
The method integrates GPU-accelerated geometric reconstruction with pretrained vision-language models through online instance-level semantic embedding fusion, using efficient spatial indexing and hierarchical association to robustly track objects and update a unified 3-D map in real time.
Key results
- Modular zero-shot 3-D semantic mapping framework
- Online instance-level geometric and semantic fusion algorithm
- R-tree spatial indexing with bipartite matching for fast 3-D tracking
- Unified geometric-semantic update mechanism for temporal consistency
Why it matters
Enables autonomous robots and AR systems to perceive and reason about arbitrary objects in dynamic environments in real time without task-specific training.
Abstract
Mapping and understanding complex 3-D environ- ments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geomet- ric reconstruction and rich semantic comprehension. While ex- isting 3-D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models (VLMs) have enabled open-vocabulary object recognition in 2-D images, they haven’t yet bridged the gap to 3-D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3-D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this article, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geomet- ric reconstruction with open-vocabulary VLMs through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2-D segmentation inconsistencies. The proposed general-purpose 3-D scene understanding framework can be used for various tasks includingzero-shot3-Dinstanceretrieval,segmentation,andobject detection to reason about previously unseen objects and interpret natural language queries.