← Back ICRA 2026

RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami

PDF

AI summary

Key figure (auto-extracted from paper)

A training-free framework that fuses GPU-accelerated geometric reconstruction with open-vocabulary vision-language models to enable real-time, consistent 3-D semantic mapping and natural language queries.

Open-vocabulary 3-D mapping Zero-shot semantic reconstruction Vision-language models Real-time SLAM Spatio-temporal aggregation Robotic scene understanding

Problem

Existing 3-D semantic mapping systems lack open-vocabulary flexibility during online operation and struggle to maintain temporal consistency with streaming data, while relying heavily on offline processing or extensive task-specific training.

Approach

The method integrates GPU-accelerated geometric reconstruction with pretrained vision-language models through online instance-level semantic embedding fusion, using efficient spatial indexing and hierarchical association to robustly track objects and update a unified 3-D map in real time.

Key results

Modular zero-shot 3-D semantic mapping framework
Online instance-level geometric and semantic fusion algorithm
R-tree spatial indexing with bipartite matching for fast 3-D tracking
Unified geometric-semantic update mechanism for temporal consistency

Why it matters

Enables autonomous robots and AR systems to perceive and reason about arbitrary objects in dynamic environments in real time without task-specific training.

Abstract

Mapping and understanding complex 3-D environ- ments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geomet- ric reconstruction and rich semantic comprehension. While ex- isting 3-D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models (VLMs) have enabled open-vocabulary object recognition in 2-D images, they haven’t yet bridged the gap to 3-D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3-D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this article, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geomet- ric reconstruction with open-vocabulary VLMs through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2-D segmentation inconsistencies. The proposed general-purpose 3-D scene understanding framework can be used for various tasks includingzero-shot3-Dinstanceretrieval,segmentation,andobject detection to reason about previously unseen objects and interpret natural language queries.

Index terms

Semantic Scene Understanding RGB-D Perception Recognition SLAM