Monocular Visual Odometry Via Diffusion-Based Joint Learning of Optical Flow and Depth
Qingyuan Hu, Wei Li, Xuebin Meng, Yu Hu
AI summary
Problem
Monocular visual odometry struggles with scale ambiguity and dynamic object interference, while existing joint learning methods lack absolute scale and rely on difficult-to-obtain dense annotations.
Approach
The authors propose JFD-VO, a framework that jointly trains optical flow and depth networks using a two-stage process with recursive noise diffusion and dynamic masking, enabling scale-aware predictions from sparse LiDAR data and pose ground truth.
Key results
- Reduces absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO
- Predicts dense, scale-aware depth and optical flow using only sparse LiDAR and pose ground truth
- Introduces recursive noise diffusion and dynamic masking to handle sparse annotations and dynamic scenes
- Improves pose estimation via Keypoint-weighted Matching Selection based on forward-backward flow consistency
Why it matters
Provides a robust, annotation-efficient solution for accurate robot and autonomous vehicle localization in real-world dynamic environments.
Abstract
Monocular visual odometry (VO) often suffers from scale ambiguity and interference from moving objects in real-world scenarios. Jointly learning optical flow and depth estimation provides a promising solution for these issues by leveraging their geometric correlation and task complementar- ity. In this paper, we propose JFD-VO, a novel monocular VO framework that integrates jointly learned optical flow and depth networks. We design a two-stage training process with recursive noise diffusion and a specialized loss function, which enables the model to predict dense and scale-aware depth and optical flow using only readily available sparse LiDAR data and pose ground truth, thereby eliminating the need for expensive and difficult- to-obtain dense annotations. Furthermore, a dedicated mask- ing module is incorporated during joint training to enhance robustness in dynamic environments. Within the VO pipeline, we introduce a Keypoint-weighted Matching Selection module that prioritizes stable features based on forward-backward flow consistency, rather than treating all pixels equally as in conventional optical flow methods. Extensive experiments on public datasets demonstrate the effectiveness of our joint training approach. JFD-VO achieves state-of-the-art accuracy, reducing absolute trajectory error by 14.99% and 27.37% over KPDepth-VO and DF-VO. Code and our self-collected dataset are available at: https://github.com/huqingyuan-9952/JFD-VO.