← Back ICRA 2026

Diff-VIO: A Diffusion Model-Based Pose Optimizer for Visual Inertial Odometry

Wenyuan Qin, Xiangxi Kong, Sizhuo Zhang, Hao XU, Xiwang Dong

PDF

AI summary

Diff-VIO leverages a conditional diffusion model to iteratively refine coarse pose estimates, significantly boosting accuracy and robustness in visual-inertial odometry.

Visual Inertial Odometry Diffusion Models Pose Refinement Deep Learning Robot Navigation Generative AI

Problem

End-to-end VIO methods struggle to explicitly model inherent sensor noise and lack robust prior constraints, limiting localization accuracy and generalization.

Approach

The framework first generates a coarse pose via cross-modal feature fusion, then applies a conditional diffusion model with transformer-based encoders and decoders to iteratively denoise and refine pose residuals.

Key results

Surpasses state-of-the-art VIO methods on the KITTI benchmark in accuracy and robustness
Demonstrates strong cross-hardware generalization on Intel RealSense D435i data
Introduces a global-context transformer and conditional decoder for effective long-range dependency modeling
Establishes a novel generative optimization paradigm for learning-based VIO

Why it matters

Advances reliable spatial localization for autonomous systems by replacing traditional discriminative optimization with a robust, noise-aware generative framework.

Abstract

Visual inertial odometry (VIO) serves as a cornerstone of environmental perception and spatial localization, with broad applications in autonomous driving, robotic navigation, and embodied intelligence. Although recent deep learning based VIO methods have achieved impressive accuracy and computational efficiency, most approaches optimize errors within a maximum a posteriori (MAP) framework, often overlooking explicit prior modeling which constrains the upper bounds of achievable performance. To address this challenge, Diff-VIO is introduced, which is a VIO optimization framework grounded in diffusion models. An end-to-end coarse pose generator is first employed. It outputs an initial pose estimate and supplies priors for the diffusion refinement. To constrain the solution space, a diffusion-based refinement module injects pose priors during generation. This process is supported by a global context transformer encoder and a conditional decoder, which model long-range dependencies and predict residual noise for precise pose refinement. Experiments conducted on the KITTI benchmark demonstrate that the proposed method outperforms state-of-the-art VIO techniques in both accuracy and robustness. Additional evaluations on a dataset collected with an Intel RealSense D435i further validate the strong generalization capability of the proposed method across diverse hardware platforms. As the first diffusion-based VIO framework, Diff-VIO introduces a novel optimization paradigm for learning-based visual-inertial odometry systems.

Index terms

Visual-Inertial SLAM Sensor Fusion Computer Vision for Automation