← Back ICRA 2026

TinyVPR: Distilling Correct and Confusing Knowledge for Lightweight Visual Place Recognition

Zhuochen Yang, Runheng Zuo, Xu Yang, Runjiang Dou, Zhe Wang, Liyuan Liu, Shuangming Yu

PDF

AI summary

Key figure (auto-extracted from paper)

A confusion-aware contrastive distillation framework enables lightweight student models to learn both correct matches and hard negative boundaries, achieving state-of-the-art accuracy-per-parameter for edge-deployable visual place recognition.

Visual Place Recognition Knowledge Distillation Lightweight Models Contrastive Learning Edge Deployment Tiny-ViT

Problem

Heavy deep models dominate Visual Place Recognition but are impractical for edge devices, while existing compression and distillation techniques fail to robustly handle hard negative distractors in complex urban scenes.

Approach

The method uses an online positive-negative contrastive distillation framework with a cross-attention alignment module and a confusion-aware Multi-Similarity loss to simultaneously transfer correct associations and confusing negative boundaries from a teacher to a lightweight student model.

Key results

Introduces a confusion-aware contrastive distillation strategy leveraging online hard negative mining
Designs a cross-attention feature alignment module to bridge teacher-student representation gaps
Achieves over 5× parameter reduction while maintaining competitive Recall@1 on Pitts30k and MSLS benchmarks
Delivers superior accuracy-per-parameter metrics compared to existing lightweight VPR baselines

Why it matters

Provides a practical, high-efficiency solution for deploying robust visual localization on resource-constrained edge devices like autonomous vehicles and AR headsets.

Abstract

Visual Place Recognition (VPR) is a key tech- nology in autonomous driving, robotics, and augmented re- ality, requiring efficient and robust localization in large-scale environments. However, most existing methods rely on heavy deep models that are computationally expensive and difficult to deploy on edge devices, limiting their practical use. While model compression techniques such as compact model fine-tuning and traditional knowledge distillation have shown some promise, they often fall short in visual retrieval tasks. Inspired by the teaching principle that emphasizes both reinforcing correct knowledge and correcting errors, we propose an online positive- negative sample contrastive distillation framework. This ap- proach enables the student model to learn more discriminative features by simultaneously distilling the relationships among positive and negative samples. We also design a cross-attention based feature alignment operator to better align intermediate feature representations between teacher and student models after feature extraction, improving feature consistency and distillation efficiency. Experimental results demonstrate that our method achieves a favorable trade-off between accuracy and efficiency on multiple visual localization benchmarks, significantly outperforming existing lightweight approaches in several scenarios. These advantages make it well-suited for deployment on resource-constrained edge devices.

Index terms

Visual Learning Vision-Based Navigation Intelligent Transportation Systems