Research Analyzer
← Back ICRA 2026

TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition

Oliver Edward Grainge, Michael J Milford, Indu Bodala, Sarvapali Ramchurn, Shoaib Ehsan

PDF

AI summary

Key figure (auto-extracted from paper)
TeTRA compresses Vision Transformers to ternary and binary precision via progressive distillation, cutting memory by up to 69% and latency by 35% while preserving or improving place recognition accuracy for resource-constrained robots.
Visual Place Recognition Vision Transformer Ternary Quantization Knowledge Distillation Model Compression Robotics

Problem

Large Vision Transformer models for Visual Place Recognition (VPR) exceed the memory and compute budgets of resource-constrained platforms like drones and mobile robots, while existing compression methods often cause significant accuracy degradation.

Approach

TeTRA employs a two-stage training pipeline that progressively quantizes a ViT backbone to ternary precision and its final embedding layer to binary, using multi-level knowledge distillation from a full-precision teacher to stabilize training and maintain representational power.

Key results

  • Up to 69% reduction in memory consumption
  • 35% lower inference latency
  • Maintains or improves Recall@1 accuracy on standard benchmarks
  • Enables high-accuracy VPR on power-limited robotic hardware

Why it matters

Delivers a Pareto-optimal balance of efficiency and accuracy, making advanced transformer-based localization viable for real-world deployment on drones and mobile robots.

Abstract

Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quan- tizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncom- pressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory- limited robotic platforms, making TeTRA an appealing solution for real-world deployment.

Index terms

Localization Recognition Deep Learning for Visual Perception

Related papers