← Back IROS 2024

Multiple Visual Features in Topological Map for Vision-And-Language Navigation

Ruonan Liu, Ping Kong, Weidong Zhang

PDF

Abstract

Vision-and-Language Navigation (VLN) in contin- uous environments aims to navigate robot agents in unseen environments following natural language instructions. The ma- jority of existing approaches rely on constructing semantic maps or topological maps to record information. However, semantic maps overlook the detailed information of objects and the correspondence among views during navigation, while topolog- ical maps lack the spatial representation between entities. To address these limitations, we propose a novel visual feature rep- resentation method for continuous VLN, called Multiple Visual Features in Topological Map (MV-Topo). MV-Topo utilizes three distinct visual encoders to extract visual features, which are integrated in the dynamically generated topological map. These fused features actively participate in the subsequent cross- modal planning to derive a long-term path towards a subgoal, effectively guiding the agent to reach the final location. We experimentally demonstrate the effectiveness of our approach and achieve competitive results on the full VLN-CE test splits. Notably, our method outperforms the state-of-the-art by 3.5% in terms of the Navigation Error (NE) metric, indicating that the utilization of multiple visual features significantly enhances the agent’s perception of semantic targets.

Index terms

Vision-Based Navigation Agent-Based Systems