← Back ICRA 2026

Structured Pruning for Efficient Visual Place Recognition

Oliver Edward Grainge, Michael J Milford, Indu Bodala, Sarvapali Ramchurn, Shoaib Ehsan

PDF

AI summary

Key figure (auto-extracted from paper)

Structured pruning cuts VPR memory and latency by over 15% with negligible accuracy loss, enabling real-time edge deployment.

Visual Place Recognition Structured Pruning Edge Computing Model Compression Real-time Localization Robotics

Problem

Real-time Visual Place Recognition on edge devices is hindered by architectural and descriptor redundancy in current CNN-based methods, which limits efficiency without sacrificing accuracy.

Approach

The method applies iterative structured pruning to simultaneously compress convolutional backbones and reduce feature descriptor dimensions across multiple VPR architectures.

Key results

21% average reduction in model and map memory usage
16% decrease in feature extraction and retrieval latency
Less than 1% drop in recall@1 accuracy
Validated on Nvidia Xavier NX embedded hardware

Why it matters

Enables efficient, real-time place recognition on low-power robotic and mobile edge devices without compromising localization accuracy.

Abstract

Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.

Index terms

Deep Learning for Visual Perception Recognition Localization