PAGTM: Position and Attention-Guided Token Merging for Efficient Visual Place Recognition
Hongchan Cho, Youngjo Lee, Jinwoo Jang, Seunghan Yu, Euntai Kim
AI summary
Problem
Vision Transformers achieve state-of-the-art Visual Place Recognition but suffer from quadratic computational complexity, limiting real-world deployment. Existing token reduction methods degrade performance by ignoring spatial layout and semantic importance.
Approach
PAGTM is a training-free inference-time framework that merges tokens based on feature similarity, 2D positional proximity, and attention-based protection to preserve scene geometry and critical landmarks.
Key results
- Training-free token merging framework tailored for ViT-based VPR
- Integrates positional proximity and attention-aware protection to preserve scene structure and key landmarks
- Consistently outperforms ToMe and ToFu across five VPR datasets and three model architectures
- Achieves superior accuracy-efficiency trade-offs, even surpassing full-token baselines under high compression
Why it matters
Enables efficient deployment of high-performance Vision Transformers for real-time robotics and autonomous driving without requiring model retraining.
Abstract
Recent advances in Vision Transformers (ViTs) have significantly improved the performance of Visual Place Recognition (VPR), but their high computational cost—due to the quadratic complexity of self-attention—limits their practical deployment in real-world scenarios. To address this challenge, we propose PAGTM (Positional- and Attention-Guided Token Merging), a training-free token reduction framework designed specifically for ViT-based VPR models. In VPR, preserving the spatial layout of a scene (e.g. road alignment, building structures) and focusing on semantically meaningful regions are both critical for robust matching under viewpoint and ap- pearance variations. However, existing token reduction methods often overlook these aspects, leading to degraded recognition performance. To address this, PAGTM incorporates two key cues. The first is positional proximity, which merges spatially adjacent tokens to maintain the scene’s structural layout. The second is attention-based token protection, which retains tokens that receive high attention because they represent regions important for distinguishing places, such as signs or distinctive structures. Without requiring any fine-tuning, PAGTM can be directly applied at inference time and consistently outperforms existing token reduction methods such as ToMe and ToFu across multiple ViT-based VPR models and datasets, achieving a better trade-off between computational efficiency and retrieval accuracy.