← Back ICRA 2026

LSV-Loc: LiDAR to Street View Image Crossmodal Localization

Sangmin Lee, Donghyun Choi, Jee-Hwan Ryu

PDF

AI summary

Key figure (auto-extracted from paper)

A single LiDAR scan matched with public StreetView panoramas enables accurate, map-free global localization for autonomous vehicles.

Cross-modal localization LiDAR StreetView Global localization Vision Transformer Place recognition

Problem

Traditional global localization relies on costly high-definition maps or GPS, which degrades in urban canyons, while existing cross-modal methods require pre-recorded reference data or route-specific calibration, limiting scalability and generalization.

Approach

The framework projects LiDAR intensity scans and StreetView panoramas into a shared equirectangular space, aligns them using a weight-shared Vision Transformer, and refines heading and position via an equirectangular perspective-n-point solver.

Key results

Weight-shared ViT encoder aligns LiDAR and StreetView features in a unified embedding space
Equirectangular PnP solver recovers accurate heading and planar translation from patch-level correspondences
Achieves high recall and heading accuracy for coarse 3-DoF localization using only a single LiDAR scan and public StreetView imagery
Eliminates dependency on HD maps, GPS initialization, and route-specific sensor calibration

Why it matters

Enables scalable, cost-effective global localization for autonomous vehicles in GPS-denied urban environments using freely available public imagery without prior mapping.

Abstract

Accurate global localization remains a fundamental challenge in autonomous vehicle navigation. Traditional methods typically rely on high-definition (HD) maps generated through prior traverses or utilize auxiliary sensors, such as a global positioning system (GPS). However, the above approaches are often limited by high costs, scalability issues, and decreased reliability where GPS is unavailable. Moreover, prior methods require route-specific sensor calibration and impose modality- specific constraints, which restrict generalization across different sensor types. The proposed framework addresses this limitation by leveraging a shared embedding space, learned via a weight- sharing Vision Transformer (ViT) encoder, that aligns heteroge- neous sensor modalities, Light Detection and Ranging (LiDAR) images, and geo-tagged StreetView panoramas. The proposed alignment enables reliable cross-modal retrieval and coarse-level localization without HD-map priors or route-specific calibration. Further, to address the heading inconsistency between query LiDAR and StreetView, an equirectangular perspective-n-point (PnP) solver is proposed to refine the relative pose through patch-level feature correspondences. As a result, the framework achieves coarse 3-degree-of-freedom (DoF) localization from a single LiDAR scan and publicly available StreetView imagery, bridging the gap between place recognition and metric local- ization. Experiments demonstrate that the proposed method achieves high recall and heading accuracy, offering scalability in urban settings covered by public Street View without reliance on HD maps. Our code will be made publicly available at: https://github.com/iismn/IEEE RA-L LSV-Loc.

Index terms

Localization Autonomous Vehicle Navigation Range Sensing