Pair-VPR: Place-Aware Pre-Training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers
Stephen Hausler, Peyman Moghadam
AI summary
Problem
Existing VPR methods typically rely on generic pre-trained weights and single-stage descriptor matching, which often fails to correctly rank the true place match among candidates, especially under large viewpoint and temporal variations.
Approach
The authors propose a two-stage Vision Transformer pipeline that uses place-aware Siamese masked image modeling for pre-training, followed by joint fine-tuning to simultaneously learn a global descriptor and a contrastive pair classifier for re-ranking.
Key results
- State-of-the-art Recall@1 across five VPR benchmarks with ViT-B
- 100% Recall@1 on Tokyo24/7 using larger ViT-L/G encoders
- Novel place-aware sampling strategy for pre-training on 5.5M+ images
- Effective two-stage inference pipeline that refines matches via a learned pair classifier
Why it matters
Provides a scalable, transformer-based VPR framework that significantly improves localization accuracy for autonomous navigation and robotics applications.
Abstract
In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trainedweights fromagenericimagedatasetsuchasImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modeling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modeling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders.