LSV-Loc: LiDAR to Street View Image Crossmodal Localization
Sangmin Lee, Donghyun Choi, Jee-Hwan Ryu
AI summary
Problem
Traditional global localization relies on costly high-definition maps or GPS, which degrades in urban canyons, while existing cross-modal methods require pre-recorded reference data or route-specific calibration, limiting scalability and generalization.
Approach
The framework projects LiDAR intensity scans and StreetView panoramas into a shared equirectangular space, aligns them using a weight-shared Vision Transformer, and refines heading and position via an equirectangular perspective-n-point solver.
Key results
- Weight-shared ViT encoder aligns LiDAR and StreetView features in a unified embedding space
- Equirectangular PnP solver recovers accurate heading and planar translation from patch-level correspondences
- Achieves high recall and heading accuracy for coarse 3-DoF localization using only a single LiDAR scan and public StreetView imagery
- Eliminates dependency on HD maps, GPS initialization, and route-specific sensor calibration
Why it matters
Enables scalable, cost-effective global localization for autonomous vehicles in GPS-denied urban environments using freely available public imagery without prior mapping.
Abstract
Accurate global localization remains a fundamental challenge in autonomous vehicle navigation. Traditional methods typically rely on high-definition (HD) maps generated through prior traverses or utilize auxiliary sensors, such as a global positioning system (GPS). However, the above approaches are often limited by high costs, scalability issues, and decreased reliability where GPS is unavailable. Moreover, prior methods require route-specific sensor calibration and impose modality- specific constraints, which restrict generalization across different sensor types. The proposed framework addresses this limitation by leveraging a shared embedding space, learned via a weight- sharing Vision Transformer (ViT) encoder, that aligns heteroge- neous sensor modalities, Light Detection and Ranging (LiDAR) images, and geo-tagged StreetView panoramas. The proposed alignment enables reliable cross-modal retrieval and coarse-level localization without HD-map priors or route-specific calibration. Further, to address the heading inconsistency between query LiDAR and StreetView, an equirectangular perspective-n-point (PnP) solver is proposed to refine the relative pose through patch-level feature correspondences. As a result, the framework achieves coarse 3-degree-of-freedom (DoF) localization from a single LiDAR scan and publicly available StreetView imagery, bridging the gap between place recognition and metric local- ization. Experiments demonstrate that the proposed method achieves high recall and heading accuracy, offering scalability in urban settings covered by public Street View without reliance on HD maps. Our code will be made publicly available at: https://github.com/iismn/IEEE RA-L LSV-Loc.