← Back IROS 2024

Enhanced Language-Guided Robot Navigation with Panoramic Semantic Depth Perception and Cross-Modal Fusion

Liuyi Wang, Jiagui Tang, Zongtao He, Ronghao Dang, Chengju Liu, Qijun Chen

PDF

Abstract

Integrating visual observation with linguistic in- struction holds significant promise for enhancing robot nav- igation across unstructured environments and enriches the human-robot interaction experience. However, while panoramic RGB views furnish robots with extensive environmental visuals, current methods significantly overlook crucial semantic and depth cues. This incomplete representation may lead to misin- terpretation or inadequate execution of language instructions, thereby impeding navigation performance and adaptability. In this paper, we introduce SEAT, a semantic-depth aware cross-modal transformer model. Our approach incorporates an efficient panoramic multi-type visual encoder to capture comprehensive environmental details. To mitigate the rigidity of feature mapping stemming from the freezing of pre-training encoders, we propose a novel region query pre-training task. Additionally, we leverage an improved dual-scale cross-modal transformer to facilitate the integration of instructions, topolog- ical memory, and action prediction. Extensive experiments on three language-guided robot navigation datasets demonstrate the efficacy of our model, achieving competitive navigation success rates with fewer parameters and computational load. Furthermore, we validate SEAT’s effectiveness in real-world scenarios by deploying it on a mobile robot across various environments. The code is available at https://github. com/CrystalSixone/SEAT.

Index terms

Vision-Based Navigation Multi-Modal Perception for HRI Deep Learning for Visual Perception