← Back IROS 2024

Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

Zongtao He, Liuyi Wang, Lu Chen, Shu Li, Qingqing Yan, Chengju Liu, Qijun Chen

PDF

Abstract

Can multimodal encoder evolve when facing in- creasingly tough circumstances? Our work investigates this possibility in the context of continuous vision-language nav- igation (continuous VLN), which aims to navigate robots under linguistic supervision and visual feedback. We propose a multimodal evolutionary encoder (MEE) comprising a unified multimodal encoder architecture and an evolutionary pre- training strategy. The unified multimodal encoder unifies rich modalities, including depth and sub-instruction, to enhance the solid understanding of environments and tasks. It also effectively utilizes monocular observation, reducing the reliance on panoramic vision. The evolutionary pre-training strategy exposes the encoder to increasingly unfamiliar data domains and difficult objectives. The multi-stage adaption helps the encoder establish robust intra- and inter-modality connections and improve its generalization to unfamiliar environments. To achieve such evolution, we collect a large-scale multi-stage dataset with specialized objectives, addressing the absence of suitable continuous VLN pre-training. Evaluation on VLN-CE demonstrates the superiority of MEE over other direct action- predicting methods. Furthermore, we deploy MEE in real scenes using self-developed service robots, showcasing its effec- tiveness and potential for real-world applications. Our code and dataset are available at https://github.com/RavenKiller/MEE.

Index terms

Vision-Based Navigation Multi-Modal Perception for HRI Representation Learning