← Back ICRA 2026

SurveilNav: Collaborative Object Goal Navigation with Robot and Surveillance System

Ming-Ming Yu, Qunbo Wang, Rongtao Xu, Yanghong Mei, YiRong Yang, Longteng Guo, Wenjun Wu, Jing Liu

PDF

AI summary

Key figure (auto-extracted from paper)

Collaborating mobile robots with building surveillance systems significantly boosts exploration efficiency and navigation success in large-scale indoor environments.

object-goal navigation collaborative perception robot-surveillance systems vision-language models indoor navigation Habitat-Sim

Problem

Single-robot object navigation struggles with limited perception and blind spots in large-scale, multi-floor settings, while existing collaborative perception research remains focused on autonomous driving rather than indoor robotics.

Approach

SurveilNav dynamically selects relevant surveillance cameras, fuses their views with the robot’s local RGB-D data to build joint maps, and uses a vision-language model to estimate semantic relevance and verify targets for efficient navigation.

Key results

First indoor robot-surveillance collaboration dataset spanning 36 scenes, 74 floors, and 206 cameras
Novel framework enabling active camera invocation, joint 2D/3D mapping, and VLM-based target verification
State-of-the-art exploration efficiency and navigation success rates on the HM3D benchmark
Robust performance across varying camera densities and multi-floor configurations

Why it matters

Provides a scalable blueprint for leveraging existing building infrastructure to enhance indoor search, rescue, and household robotics tasks.

Abstract

With the growing deployment of surveillance sys- tems in factories, offices, and homes, integrating them with robots offers a promising direction for collaborative and ef- ficient task execution. However, existing approaches largely focus on single-robot scenarios and struggle with multi-view collaboration in large-scale environments. In this paper, we present a novel indoor collaborative object navigation dataset built on Habitat-Sim, featuring 206 cameras across 74 floors. The dataset enables systematic evaluation of an agent’s ability to exploit multi-view surveillance information. To address the limitations of single-robot perception, we propose SurveilNav, a collaborative navigation framework that integrates active camera scheduling, joint 2D/3D mapping, VLM-based value estimation, and collaborative target verification. By synergizing the robot’s dynamic local perception with the static global view of surveillance, this architecture effectively overcomes both the limited perception range of single agents and the inherent blind spots of fixed cameras, resolving inefficient exploration. Experimental results on the HM3D dataset demonstrate that SurveilNav substantially outperforms existing methods, achiev- ing state-of-the-art performance in both exploration efficiency and navigation success rate. Moreover, the system shows strong potential for applications in large-scale search, home environ- ments, and rescue missions.

Index terms

Surveillance Robotic Systems Cognitive Modeling