← Back ICRA 2026

Dilated Superpixel Aggregation for Visual Place Recognition

Zichao Zeng, June Moh Goo, Jan Boehm

PDF

AI summary

Key figure (auto-extracted from paper)

Lightweight, non-semantic dilated superpixels enable faster and more robust visual place recognition than heavy semantic segmentation methods.

Visual Place Recognition Dilated Superpixels Feature Aggregation DINOv2 Multi-scale VPR Robotics Localization

Problem

Existing segment-level VPR methods rely on computationally heavy semantic segmenters that discard valuable pixel information and struggle with viewpoint and scale variations.

Approach

The method replaces heavy semantic models with lightweight superpixel clustering and expands them into dilated superpixels to aggregate DINOv2 features into compact, multi-scale descriptors.

Key results

Outperforms SOTA segment-based VPR methods on 12 diverse benchmarks
Significantly reduces computational latency while preserving complete pixel information
Enhances robustness to viewpoint, scale, and seasonal environmental changes
Delivers high accuracy without requiring model fine-tuning

Why it matters

Enables resource-constrained robotic systems to achieve high-performance, robust localization across diverse real-world environments.

Abstract

Visual Place Recognition (VPR) is a fundamental task in robotics and computer vision, enabling systems to iden- tify locations seen in the past using visual information. Previous state-of-the-art approach focuses on encoding and retrieving se- mantically meaningful supersegment representations of images to significantly enhance recognition recall rates. However, we find that they struggle to cope with significant variations in viewpoint and scale, as well as scenes with sparse or limited information. Furthermore, these semantic-driven supersegment representations often exclude semantically meaningless yet valuable pixel informa- tion. In this work, we present Sel-V and MuSSel-V, two efficient variantswithinthesegment-levelVPRparadigmthatreplaceheavy and fragmented supersegments with lightweight, visually compact and complete dilated superpixels for local feature aggregation. The use of superpixels preserves pixel-level details while reducing computational overhead. A multi-scale extension further enhances robustness to viewpoint and scale changes. Comprehensive ex- periments on twelve public benchmarks show that our approach achieves a better trade-off between accuracy and efficiency than existing segment-based methods. These results demonstrate that lightweight, non-semantic segmentation can serve as an effective alternative for high-performance, resource efficient visual place recognition in robotics. IndexTerms—Localization,vision-basednavigation,visualplace recognition (VPR), superpixel, aggregation.

Index terms

Localization Recognition Vision-Based Navigation