← Back ICRA 2026

AWENet: A Self-Supervised Network for Efficient Interest Point Detection and Description

Pengwei Jia, Kang Li, Siren Batu

PDF

AI summary

Key figure (auto-extracted from paper)

AWENet achieves competitive matching accuracy and the lowest localization error while drastically reducing computational cost through a lightweight, self-supervised architecture.

interest point detection local feature description self-supervised learning wavelet downsampling multi-scale attention computational efficiency

Problem

Existing end-to-end interest point detection and description networks struggle with high computational overhead, limited descriptor discriminability in self-supervised settings, and suboptimal localization accuracy.

Approach

The proposed network combines accelerated convolutional expansion, wavelet-based downsampling to preserve multi-frequency details, and multi-scale attention, all guided by multi-objective knowledge distillation from a teacher model.

Key results

Lowest localization error on HPatches
Top homography estimation accuracy at 3 and 5 pixel thresholds
Highest mean matching accuracy under illumination changes
Significantly improved processing speed with competitive matching scores

Why it matters

Enables efficient, high-precision local feature extraction for real-time computer vision tasks like SLAM and visual localization on resource-constrained hardware.

Abstract

We introduce AWENet (Attention-guided Wavelet Enhancement Network), an efficient self-supervised network for joint interest point detection and description that balances com- putational speed with feature accuracy. The network preserves fine structural details while employing multi-scale attention to enhance the discriminability of descriptors, leading to more precise and reliable interest point correspondences. Evaluations on the HPatches dataset demonstrate that AWENet achieves competitive performance in repeatability, localization accuracy, and matching robustness. Its lightweight design ensures fast processing and low computational cost, making it well-suited for applications where efficiency is critical. Qualitative results show that the network generates dense and accurate correspondences under diverse transformations, including changes in viewpoint and illumination. Overall, AWENet provides a practical and effective solution for learning local features, achieving strong matching performance without relying on heavy computation.

Index terms

Deep Learning for Visual Perception Deep Learning Methods Audio-Visual SLAM