VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
Hannah Shafferman, Annika Thomas, Jouko Kinnari, Michael Ricard, Jose Nino, Jonathan How
AI summary
Problem
Global localization in unstructured environments fails under severe appearance changes like viewpoint shifts, seasonal variations, and occlusions, especially with off-nadir monocular cameras, while existing methods struggle with computational constraints or require domain-specific training.
Approach
VISTA uses an auto-segmentation object tracker to build sparse, uncertainty-aware 3D object maps from monocular video, then aligns frames via a geometric submap correspondence search that exploits spatial consistency between maps.
Key results
- Up to 69% improvement in localization recall over baselines
- 62× reduction in correspondence search computation time
- Compact object maps at 0.03%–0.6% baseline memory size
- Zero-shot, view-invariant localization across seasonal and oblique datasets
Why it matters
It provides a computationally efficient, robust localization solution for multi-agent and UAV systems operating in GNSS-denied, unstructured environments with varying camera orientations.
Abstract
Global localization is critical for autonomous nav- igation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the cor- relation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, occlusions, and perceptual aliasing in homogeneous environ- ments — known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View- Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and track- ing pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consis- tent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory- conservative baseline, making our approach capable of real- time implementation on resource-constrained platforms.