Dilated Superpixel Aggregation for Visual Place Recognition
Zichao Zeng, June Moh Goo, Jan Boehm
AI summary
Problem
Existing segment-level VPR methods rely on computationally heavy semantic segmenters that discard valuable pixel information and struggle with viewpoint and scale variations.
Approach
The method replaces heavy semantic models with lightweight superpixel clustering and expands them into dilated superpixels to aggregate DINOv2 features into compact, multi-scale descriptors.
Key results
- Outperforms SOTA segment-based VPR methods on 12 diverse benchmarks
- Significantly reduces computational latency while preserving complete pixel information
- Enhances robustness to viewpoint, scale, and seasonal environmental changes
- Delivers high accuracy without requiring model fine-tuning
Why it matters
Enables resource-constrained robotic systems to achieve high-performance, robust localization across diverse real-world environments.
Abstract
Visual Place Recognition (VPR) is a fundamental task in robotics and computer vision, enabling systems to iden- tify locations seen in the past using visual information. Previous state-of-the-art approach focuses on encoding and retrieving se- mantically meaningful supersegment representations of images to significantly enhance recognition recall rates. However, we find that they struggle to cope with significant variations in viewpoint and scale, as well as scenes with sparse or limited information. Furthermore, these semantic-driven supersegment representations often exclude semantically meaningless yet valuable pixel informa- tion. In this work, we present Sel-V and MuSSel-V, two efficient variantswithinthesegment-levelVPRparadigmthatreplaceheavy and fragmented supersegments with lightweight, visually compact and complete dilated superpixels for local feature aggregation. The use of superpixels preserves pixel-level details while reducing computational overhead. A multi-scale extension further enhances robustness to viewpoint and scale changes. Comprehensive ex- periments on twelve public benchmarks show that our approach achieves a better trade-off between accuracy and efficiency than existing segment-based methods. These results demonstrate that lightweight, non-semantic segmentation can serve as an effective alternative for high-performance, resource efficient visual place recognition in robotics. IndexTerms—Localization,vision-basednavigation,visualplace recognition (VPR), superpixel, aggregation.