← Back ICRA 2026

Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Jiuhong Xiao, Yang Zhou, Giuseppe Loianno

PDF

AI summary

QAA enables robust, universal visual place recognition across diverse datasets by using learned queries as reference codebooks to compute cross-query similarity, outperforming state-of-the-art models with lower computational overhead.

Visual Place Recognition Multi-Dataset Training Feature Aggregation Cross-Query Similarity Learned Queries Universal VPR

Problem

Single-dataset VPR models suffer from dataset-specific biases and poor generalization, while multi-dataset joint training often underperforms due to limited information capacity in feature aggregation layers when handling divergent data.

Approach

The method employs learned queries as independent reference codebooks to compute a cross-query similarity matrix with image features, generating robust descriptors without increasing output dimensionality or computational cost.

Key results

Outperforms state-of-the-art VPR models across multi-view and front-view benchmarks
Achieves peak performance comparable to dataset-specific models using a smaller descriptor dimension
Introduces Cross-query Similarity aggregation that preserves higher information capacity than score-based methods
Demonstrates scalable query usage with minimal computational and parameter overhead

Why it matters

Enables robotics and computer vision systems to achieve robust, universal place recognition across diverse real-world environments without sacrificing efficiency.

Abstract

Deep learning methods for Visual Place Recogni- tion (VPR) have advanced significantly, largely driven by large- scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the- art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: xjh19971.github.io/QAA.

Index terms

Deep Learning for Visual Perception