← Back ICRA 2026

Sat-RoMa: Cross-Scale Dense Matching for Multi-TemporalÂ UAV-To-Orthophoto Registration

Maciej Krupka, Jan Węgrzynowski, Piotr Skrzypczynski

PDF

AI summary

Key figure (auto-extracted from paper)

Sat-RoMa achieves an 11.2% scale error in drone-to-satellite matching, drastically outperforming existing methods and enabling reliable GPS-denied UAV navigation.

GPS-denied navigation UAV localization cross-scale matching satellite imagery dense feature matching temporal robustness

Problem

Current feature matchers fail to accurately localize downward-facing drone cameras against reference satellite maps due to severe seasonal appearance changes and extreme scale discrepancies. This gap prevents reliable absolute positioning for UAVs in GPS-denied environments.

Approach

Sat-RoMa adapts the RoMa architecture by freezing a satellite-pretrained DinoV3 encoder and training end-to-end on cross-seasonal image pairs. It explicitly handles matching a small drone query to a 4× larger reference map.

Key results

Achieves 11.2% scale error versus over 100% for baselines
Reduces reprojection error to 42.3 pixels, a 6–7× improvement
Lowers rotational error to 11.1° while maintaining geometric fidelity
Demonstrates robustness to severe seasonal and structural appearance changes

Why it matters

Provides a critical drift-correction mechanism for UAVs operating in GPS-denied environments, advancing autonomous search and rescue and mapping applications.

Abstract

Reliable Global Navigation Satellite System (GNSS) signals are increasingly denied or jammed in real- world applications, such as search and rescue operations. In such scenarios, Unmanned Aerial Vehicles (UAVs) must rely on downward-facing cameras for absolute localization against reference satellite maps. While Visual Inertial Odometry (VIO) is highly accurate locally, it inevitably accumulates drift over time. Localizing a drone image against a pre-existing satellite map (e.g., Google Earth) via homography estimation is a viable solution, but it is severely challenged by seasonal variations, construction, and vegetation changes. In this paper, we propose Sat-RoMa, an end-to-end robust dense feature matcher adapted from the state-of-the-art RoMa architecture. By utilizing a frozen, pre-trained DinoV3 encoder specifically tuned for satel- lite imagery, and formulating the task as matching a small drone image to a 4× larger reference map, Sat-RoMa explicitly handles scale discrepancies and temporal appearance changes. Preliminary results demonstrate that Sat-RoMa significantly outperforms baselines like LoFTR and LightGlue, achieving an 11.2% scale error compared to over 100% for existing methods, paving the way for robust GPS-denied UAV navigation.

Index terms

Localization Aerial Systems: Perception and Autonomy Deep Learning for Visual Perception