← Back ICRA 2026

VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization

Hannah Shafferman, Annika Thomas, Jouko Kinnari, Michael Ricard, Jose Nino, Jonathan How

PDF

AI summary

Key figure (auto-extracted from paper)

VISTA enables robust, real-time global localization across diverse viewpoints and seasonal changes using a lightweight, open-set monocular segmentation pipeline without domain-specific training.

global localization monocular segmentation view-invariant mapping sparse object maps geometric correspondence UAV navigation

Problem

Global localization in unstructured environments fails under severe appearance changes like viewpoint shifts, seasonal variations, and occlusions, especially with off-nadir monocular cameras, while existing methods struggle with computational constraints or require domain-specific training.

Approach

VISTA uses an auto-segmentation object tracker to build sparse, uncertainty-aware 3D object maps from monocular video, then aligns frames via a geometric submap correspondence search that exploits spatial consistency between maps.

Key results

Up to 69% improvement in localization recall over baselines
62× reduction in correspondence search computation time
Compact object maps at 0.03%–0.6% baseline memory size
Zero-shot, view-invariant localization across seasonal and oblique datasets

Why it matters

It provides a computationally efficient, robust localization solution for multi-agent and UAV systems operating in GNSS-denied, unstructured environments with varying camera orientations.

Abstract

Global localization is critical for autonomous nav- igation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the cor- relation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, occlusions, and perceptual aliasing in homogeneous environ- ments — known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View- Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and track- ing pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consis- tent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory- conservative baseline, making our approach capable of real- time implementation on resource-constrained platforms.

Index terms

Localization Mapping Object Detection Segmentation and Categorization