← Back ICRA 2026

Manydepth2: Motion-Aware Self-Supervised Monocular Depth Estimation in Dynamic Scenes

kaichen zhou, Jiawang Bian, Jian-Qing Zheng, Jia-Xing Zhong, Qian Xie, Andrew Markham, Niki Trigoni

PDF

AI summary

Key figure (auto-extracted from paper)

Manydepth2 significantly improves self-supervised monocular depth estimation accuracy in dynamic scenes by integrating optical flow to build a motion-aware cost volume and an attention-based depth network.

Self-supervised depth estimation Monocular depth estimation Motion-aware cost volume Optical flow Dynamic scenes Attention network

Problem

Self-supervised monocular depth estimation struggles in dynamic scenes because it typically assumes a static world, causing performance gaps compared to stereo methods when handling moving objects.

Approach

The method uses a pre-trained optical flow model to distinguish dynamic from static regions, constructs a pseudo-static reference frame, and builds a motion-aware cost volume combined with an attention-based depth network that fuses multi-resolution features.

Key results

~5% RMSE reduction on KITTI-2015
Outperforms ManyDepth and DynamicDepth on KITTI and Cityscapes
Accurate depth estimation for dynamic foreground and static background
Efficient single-GPU training with minimal overhead

Why it matters

Enables robust, accurate 3D scene understanding for autonomous driving and robotics in real-world environments where objects are constantly moving.

Abstract

Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To address the challenges introduced by dynamic content, we incorporate optical flow into monocular depth estimation, allowing our model to distinguish between dynamic and static regions in multi-frame inputs. We then construct a motion-aware cost volume across multiple frames by incorporating dynamic region information, which is used for accurate depth estimation. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non- local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self- supervisedmonoculardepthestimationontheKITTI-2015dataset.

Index terms

SLAM Visual-Inertial SLAM Visual Learning