TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition
Oliver Edward Grainge, Michael J Milford, Indu Bodala, Sarvapali Ramchurn, Shoaib Ehsan
AI summary
Problem
Large Vision Transformer models for Visual Place Recognition (VPR) exceed the memory and compute budgets of resource-constrained platforms like drones and mobile robots, while existing compression methods often cause significant accuracy degradation.
Approach
TeTRA employs a two-stage training pipeline that progressively quantizes a ViT backbone to ternary precision and its final embedding layer to binary, using multi-level knowledge distillation from a full-precision teacher to stabilize training and maintain representational power.
Key results
- Up to 69% reduction in memory consumption
- 35% lower inference latency
- Maintains or improves Recall@1 accuracy on standard benchmarks
- Enables high-accuracy VPR on power-limited robotic hardware
Why it matters
Delivers a Pareto-optimal balance of efficiency and accuracy, making advanced transformer-based localization viable for real-world deployment on drones and mobile robots.
Abstract
Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quan- tizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncom- pressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory- limited robotic platforms, making TeTRA an appealing solution for real-world deployment.