← Back ICRA 2026

Pair-VPR: Place-Aware Pre-Training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

Stephen Hausler, Peyman Moghadam

PDF

AI summary

Key figure (auto-extracted from paper)

Pair-VPR achieves state-of-the-art visual place recognition by jointly training a global descriptor and a pair classifier using place-aware pre-training and contrastive learning.

Visual Place Recognition Vision Transformers Masked Image Modeling Contrastive Learning Re-ranking Autonomous Navigation

Problem

Existing VPR methods typically rely on generic pre-trained weights and single-stage descriptor matching, which often fails to correctly rank the true place match among candidates, especially under large viewpoint and temporal variations.

Approach

The authors propose a two-stage Vision Transformer pipeline that uses place-aware Siamese masked image modeling for pre-training, followed by joint fine-tuning to simultaneously learn a global descriptor and a contrastive pair classifier for re-ranking.

Key results

State-of-the-art Recall@1 across five VPR benchmarks with ViT-B
100% Recall@1 on Tokyo24/7 using larger ViT-L/G encoders
Novel place-aware sampling strategy for pre-training on 5.5M+ images
Effective two-stage inference pipeline that refines matches via a learned pair classifier

Why it matters

Provides a scalable, transformer-based VPR framework that significantly improves localization accuracy for autonomous navigation and robotics applications.

Abstract

In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trainedweights fromagenericimagedatasetsuchasImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modeling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modeling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders.

Index terms

Deep Learning for Visual Perception Recognition Localization