Manydepth2: Motion-Aware Self-Supervised Monocular Depth Estimation in Dynamic Scenes
kaichen zhou, Jiawang Bian, Jian-Qing Zheng, Jia-Xing Zhong, Qian Xie, Andrew Markham, Niki Trigoni
AI summary
Problem
Self-supervised monocular depth estimation struggles in dynamic scenes because it typically assumes a static world, causing performance gaps compared to stereo methods when handling moving objects.
Approach
The method uses a pre-trained optical flow model to distinguish dynamic from static regions, constructs a pseudo-static reference frame, and builds a motion-aware cost volume combined with an attention-based depth network that fuses multi-resolution features.
Key results
- ~5% RMSE reduction on KITTI-2015
- Outperforms ManyDepth and DynamicDepth on KITTI and Cityscapes
- Accurate depth estimation for dynamic foreground and static background
- Efficient single-GPU training with minimal overhead
Why it matters
Enables robust, accurate 3D scene understanding for autonomous driving and robotics in real-world environments where objects are constantly moving.
Abstract
Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To address the challenges introduced by dynamic content, we incorporate optical flow into monocular depth estimation, allowing our model to distinguish between dynamic and static regions in multi-frame inputs. We then construct a motion-aware cost volume across multiple frames by incorporating dynamic region information, which is used for accurate depth estimation. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non- local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self- supervisedmonoculardepthestimationontheKITTI-2015dataset.