Cross-Distill: Multi-Manifold and Viewpoint-Decoupled Distillation for Cross-View Geo-Localization
Jiaxu Gao, Shuying Zhao, Yunzhou Zhang, Hongyu Zhou, Man Qi, Jiabo Shen, Yu Zhang
AI summary
Problem
Severe viewpoint variations in cross-view geo-localization hinder lightweight models from achieving high accuracy without heavy backbones, making real-time deployment on resource-constrained UAVs impractical.
Approach
Cross-Distill transfers knowledge from a heavy teacher to a lightweight student by decoupling ranking relations across viewpoints and aligning features across spherical, Euclidean, and hyperbolic manifolds.
Key results
- UAV-to-SAT recall@1 improves from 75.97% to 94.43% on University-1652
- Achieves 95.33% average precision on the same benchmark
- Maintains low complexity with 31.43M parameters and 13.09 GFLOPs
- Delivers fast inference at 62.02 ms per image on an RK3588 chip
Why it matters
Enables accurate, real-time geo-localization on lightweight UAVs, bridging the gap between high-performance retrieval models and practical aerial robotics deployment.
Abstract
Cross-View Geo-Localization (CVGL) localizes a query image via retrieval from georeferenced satellite imagery, yet severe viewpoint variation remains a central challenge. Recent advances often rely on heavy backbones or add-on modules that achieve high accuracy but are impractical on resource-constrained UAVs. To balance accuracy and efficiency, we introduce Cross-Distill, a knowledge-distillation framework for CVGL. Cross-Distill performs Cross-Similarity Ranking Distillation by constructing a teacher–student interaction ma- trix to enforce ranking consistency and enhance discrimina- tion. Building on this, it introduces Viewpoint Decoupling, which partitions ranking relations into intra-view, intra-to- cross-view, and cross-to-cross-view, enabling precise modeling of cross-view dependencies and improving class compactness and separability. Cross-Distill further employs Multi-Manifold Feature Distillation that jointly enforces angular consistency on the spherical manifold, preserves local distances in Euclidean space, and leverages hyperbolic distance as a negatively curved metric to strengthen teacher–student alignment. Experiments on University-1652 and SUES-200 show that the distilled student achieves significant gains with low complexity (31.43M parameters, 13.09 GFLOPs), and an inference time of only 62.02 ms per image on an RK3588. For instance, on University-1652 UAV→SAT retrieval, R@1 improves from 75.97% to 94.43% and AP from 79.24% to 95.33%.