← Back ICRA 2026

Monocular Visual Odometry Via Diffusion-Based Joint Learning of Optical Flow and Depth

Qingyuan Hu, Wei Li, Xuebin Meng, Yu Hu

PDF

AI summary

Key figure (auto-extracted from paper)

JFD-VO achieves state-of-the-art monocular visual odometry accuracy by jointly learning optical flow and depth with recursive noise diffusion, resolving scale ambiguity using only sparse LiDAR data.

Monocular visual odometry joint learning optical flow depth estimation diffusion models sparse LiDAR

Problem

Monocular visual odometry struggles with scale ambiguity and dynamic object interference, while existing joint learning methods lack absolute scale and rely on difficult-to-obtain dense annotations.

Approach

The authors propose JFD-VO, a framework that jointly trains optical flow and depth networks using a two-stage process with recursive noise diffusion and dynamic masking, enabling scale-aware predictions from sparse LiDAR data and pose ground truth.

Key results

Reduces absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO
Predicts dense, scale-aware depth and optical flow using only sparse LiDAR and pose ground truth
Introduces recursive noise diffusion and dynamic masking to handle sparse annotations and dynamic scenes
Improves pose estimation via Keypoint-weighted Matching Selection based on forward-backward flow consistency

Why it matters

Provides a robust, annotation-efficient solution for accurate robot and autonomous vehicle localization in real-world dynamic environments.

Abstract

Monocular visual odometry (VO) often suffers from scale ambiguity and interference from moving objects in real-world scenarios. Jointly learning optical flow and depth estimation provides a promising solution for these issues by leveraging their geometric correlation and task complementar- ity. In this paper, we propose JFD-VO, a novel monocular VO framework that integrates jointly learned optical flow and depth networks. We design a two-stage training process with recursive noise diffusion and a specialized loss function, which enables the model to predict dense and scale-aware depth and optical flow using only readily available sparse LiDAR data and pose ground truth, thereby eliminating the need for expensive and difficult- to-obtain dense annotations. Furthermore, a dedicated mask- ing module is incorporated during joint training to enhance robustness in dynamic environments. Within the VO pipeline, we introduce a Keypoint-weighted Matching Selection module that prioritizes stable features based on forward-backward flow consistency, rather than treating all pixels equally as in conventional optical flow methods. Extensive experiments on public datasets demonstrate the effectiveness of our joint training approach. JFD-VO achieves state-of-the-art accuracy, reducing absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO. Code and our self-collected dataset are available at: https://github.com/huqingyuan-9952/JFD-VO.

Index terms

Localization SLAM Visual Learning