← Back ICRA 2026

PAGTM: Position and Attention-Guided Token Merging for Efficient Visual Place Recognition

Hongchan Cho, Youngjo Lee, Jinwoo Jang, Seunghan Yu, Euntai Kim

PDF

AI summary

Key figure (auto-extracted from paper)

PAGTM enables training-free, efficient deployment of Vision Transformers for Visual Place Recognition by preserving spatial structure and key landmarks during token reduction.

Visual Place Recognition Vision Transformers Token Merging Efficient Inference Spatial Awareness Attention Mechanism

Problem

Vision Transformers achieve state-of-the-art Visual Place Recognition but suffer from quadratic computational complexity, limiting real-world deployment. Existing token reduction methods degrade performance by ignoring spatial layout and semantic importance.

Approach

PAGTM is a training-free inference-time framework that merges tokens based on feature similarity, 2D positional proximity, and attention-based protection to preserve scene geometry and critical landmarks.

Key results

Training-free token merging framework tailored for ViT-based VPR
Integrates positional proximity and attention-aware protection to preserve scene structure and key landmarks
Consistently outperforms ToMe and ToFu across five VPR datasets and three model architectures
Achieves superior accuracy-efficiency trade-offs, even surpassing full-token baselines under high compression

Why it matters

Enables efficient deployment of high-performance Vision Transformers for real-time robotics and autonomous driving without requiring model retraining.

Abstract

Recent advances in Vision Transformers (ViTs) have significantly improved the performance of Visual Place Recognition (VPR), but their high computational cost—due to the quadratic complexity of self-attention—limits their practical deployment in real-world scenarios. To address this challenge, we propose PAGTM (Positional- and Attention-Guided Token Merging), a training-free token reduction framework designed specifically for ViT-based VPR models. In VPR, preserving the spatial layout of a scene (e.g. road alignment, building structures) and focusing on semantically meaningful regions are both critical for robust matching under viewpoint and ap- pearance variations. However, existing token reduction methods often overlook these aspects, leading to degraded recognition performance. To address this, PAGTM incorporates two key cues. The first is positional proximity, which merges spatially adjacent tokens to maintain the scene’s structural layout. The second is attention-based token protection, which retains tokens that receive high attention because they represent regions important for distinguishing places, such as signs or distinctive structures. Without requiring any fine-tuning, PAGTM can be directly applied at inference time and consistently outperforms existing token reduction methods such as ToMe and ToFu across multiple ViT-based VPR models and datasets, achieving a better trade-off between computational efficiency and retrieval accuracy.

Index terms

Deep Learning for Visual Perception Recognition Localization