Research Analyzer
← Back ICRA 2026

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models

Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, Ming Cao

PDF

AI summary

Key figure (auto-extracted from paper)
A vision language model acting as a global planner enables efficient, zero-shot multi-robot frontier assignment, significantly boosting navigation success and speed in complex environments.
Multi-robot navigation Vision language models Visual semantic navigation Frontier assignment Zero-shot planning Cooperative exploration

Problem

Existing visual target navigation methods are typically single-robot, lack common-sense reasoning, and suffer from poor efficiency and robustness in complex, unknown environments.

Approach

Co-NavGPT merges local maps from multiple robots into a unified global representation and uses a vision language model to assign unexplored frontier regions to each robot based on spatial and semantic context.

Key results

  • Outperforms existing baselines in success rate and navigation efficiency on HM3D simulation
  • Achieves real-time planning (~5 FPS) in real-world quadruped robot deployments
  • Enables zero-shot cooperative navigation without task-specific training
  • Ablation studies confirm VLM semantic priors significantly enhance collaborative search

Why it matters

Provides a scalable, training-free framework for coordinated multi-robot exploration, advancing practical applications in search, logistics, and human-robot interaction.

Abstract

Visual target navigation is a critical capability for autonomous robots operating in unknown environments, par- ticularly in human-robot interaction scenarios. While classical and learning-based methods have shown promise, most exist- ing approaches lack common-sense reasoning and are typically designed for single-robot settings, leading to reduced efficiency and robustness in complex environments. To address these limitations, we introduce Co-NavGPT, a novel framework that integrates a Vision Language Model (VLM) as a global planner to enable common-sense multi-robot visual target navigation. Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map, encoding robot states and frontier regions. The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration. Experiments on the Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT outperforms existing baselines in terms of success rate and navigation efficiency, without requiring task-specific training. Ablation studies further confirm the importance of semantic priors from the VLM. We also validate the framework in real-world scenarios using quadrupedal robots. Supplementary video and code are available at: https://sites.google.com/view/co-navgpt2.

Index terms

Vision-Based Navigation Multi-Robot Systems AI-Enabled Robotics

Related papers