CAHIR: Co-Attentive Hierarchical Image Representations for Visual Place Recognition
Guohao Peng, Heshan Li, Yifeng Huang, Jun Zhang, Mingxing Wen, Singh Rahul, Danwei Wang
Abstract
Robust visual place recognition (VPR) against significant appearance changes is crucial for the life-long operation of mobile robots. Focusing on this task, we propose a Co-Attentive Hierarchical Image Representations (CAHIR) framework for VPR, which unifies attention-sharing global and local descriptor generation into one encoding pipeline. The hierarchical descriptors are applied to a coarse-to-fine VPR system with global retrieval and local geometric verification. To explore high-quality local matches between task-relevant visual elements, a cross-attention mutual enhancement layer is introduced to strengthen the information interaction between the local descriptors. Through the proposed selective matching distillation, the mutual enhancement layer can learn from state-of-the-art local matchers in a distillation manner. After weighted cross-matching of the enhanced local descriptors, geometric verification is applied to evaluate the spatial consis- tency of the compared image pair. Experiments show CAHIR outperforms the existing global and local representations for VPR in terms of performance and efficiency. Quantitatively, it achieves state-of-the-art results on three city-scale benchmark datasets. Qualitatively, CAHIR proves to attach great impor- tance to task-relevant visual elements and excels at finding local correspondences that are discriminative to the VPR task.