← Back ICRA 2026

DiffVL: Diffusion-Based Visual Localization on 2D Maps Via BEV-Conditioned GPS Denoising

Li Gao, Hongyang Sun, Liu Liu, Yunhao Li, Yang Cai

PDF

AI summary

Key figure (auto-extracted from paper)

DiffVL reframes visual localization as a GPS denoising task using diffusion models, achieving sub-meter accuracy on low-cost SD maps without requiring HD maps.

Visual localization Diffusion models Standard-definition maps GPS denoising Bird's-Eye View Autonomous driving

Problem

Current SD-map-based localization methods ignore ubiquitous noisy GPS data and rely on error-prone BEV matching, while costly HD maps limit scalability.

Approach

The method conditions a diffusion model on BEV features from a single image and SD map data to iteratively denoise raw GPS trajectories into precise poses.

Key results

First diffusion-based framework to denoise noisy GPS for visual localization
Achieves sub-meter accuracy on standard-definition maps without HD map dependency
Outperforms state-of-the-art BEV-matching baselines across KITTI, nuScenes, and MGL datasets
Introduces a dual-objective loss balancing trajectory refinement and geometric BEV-map regularization

Why it matters

Provides a scalable, low-cost alternative to HD maps for precise autonomous navigation by leveraging ubiquitous noisy GPS and generative AI.

Abstract

Accurate visual localization is crucial for au- tonomous driving, yet existing methods face a fundamen- tal dilemma: While high-definition (HD) maps provide high- precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird’s-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through itera- tive diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub- meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior—making a paradigm shift from traditional matching- based methods. Code and models will be open-sourced.

Index terms

Localization Deep Learning for Visual Perception Computer Vision for Automation