← Back ICRA 2026

SAFL-Geo: Structure-Aware Feature Learning with Fusion Loss for Infrared-Visible Geo-Localization

Jiabo Shen, Shuying Zhao, Yunzhou Zhang, Tengda Zhang, Hongyu Zhou, Yu Zhang, Jiaxu Gao

PDF

AI summary

Key figure (auto-extracted from paper)

SAFL-Geo achieves state-of-the-art cross-modal geo-localization by leveraging structural features and a fusion loss bridge to overcome the infrared-visible modality gap.

Cross-modal geo-localization Thermal infrared Structure-aware learning Fusion loss UAV navigation Contrastive learning

Problem

Cross-modal geo-localization between thermal infrared and visible satellite imagery suffers from a large modality gap and loss of fine-grained structural details, limiting drone navigation in low-light or adverse weather.

Approach

The proposed SAFL-Geo network extracts modality-invariant structural cues via dual attention, aggregates features into a unified embedding space, and bridges the modality gap using a cross-attention fusion module constrained by a novel fusion loss.

Key results

80.55% R@1 and 83.13% AP on Boson dataset, surpassing SOTA by >8%
Structure-aware module extracts stable cross-modal contours and edges
Feature aggregation module projects multi-modal features into a unified embedding space
Fusion loss strategy bridges modalities via soft intermediate features

Why it matters

Enables reliable UAV localization and navigation in low-visibility conditions, advancing all-weather autonomous drone deployment.

Abstract

Cross-modal Visual Geo-localization often aims to retrieve a satellite visible-light image of the same geographic lo- cation from a large-scale database using an infrared image cap- tured by an unmanned aerial vehicle (UAV), thereby achieving precise localization. This capability is crucial for autonomous drone localization and navigation in low-light conditions such as nighttime or smoky environments. However, research in this field is still in its nascent stage, with existing methods being few in number and limited in precision. To address these issues, this paper proposes a structure-aware and fusion-loss constrained cross-modal geo-localization network (SAFL-Geo), which enhances the accuracy of cross-modal image retrieval. Specifically, we design a structure-aware module embedded into the network backbone, substantially enhancing the model’s abil- ity to perceive and extract cross-modally consistent structural features (such as road and building contours). Furthermore, we propose a feature enhancement and aggregation module that projects the refined multi-modal representations into a unified embedding space, effectively reducing the cross-modal representation gap while preserving discriminative semantic structures. Finally, we propose a fusion loss constraint strategy that constructs intermediate fused features as a “bridge” to constrain the distribution distances between infrared and fused features, as well as between visible and fused features, thereby indirectly mitigating the modality gap. Extensive experiments on the Boson datasets show that our SAFL-Geo achieves superior state-of-the-art performance.

Index terms

Localization Deep Learning for Visual Perception