Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition
Jiuhong Xiao, Yang Zhou, Giuseppe Loianno
AI summary
Problem
Single-dataset VPR models suffer from dataset-specific biases and poor generalization, while multi-dataset joint training often underperforms due to limited information capacity in feature aggregation layers when handling divergent data.
Approach
The method employs learned queries as independent reference codebooks to compute a cross-query similarity matrix with image features, generating robust descriptors without increasing output dimensionality or computational cost.
Key results
- Outperforms state-of-the-art VPR models across multi-view and front-view benchmarks
- Achieves peak performance comparable to dataset-specific models using a smaller descriptor dimension
- Introduces Cross-query Similarity aggregation that preserves higher information capacity than score-based methods
- Demonstrates scalable query usage with minimal computational and parameter overhead
Why it matters
Enables robotics and computer vision systems to achieve robust, universal place recognition across diverse real-world environments without sacrificing efficiency.
Abstract
Deep learning methods for Visual Place Recogni- tion (VPR) have advanced significantly, largely driven by large- scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the- art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: xjh19971.github.io/QAA.