← Back ICRA 2026

StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes

Zhengri Wu, Yiran Wang, yu wen, Zeyu Zhang, Biao wu, Hao Tang

PDF

AI summary

Key figure (auto-extracted from paper)

StereoAdapter achieves state-of-the-art underwater stereo depth estimation by efficiently adapting monocular foundation models with LoRA and refining them via self-supervised recurrent stereo matching.

underwater depth estimation stereo matching LoRA adaptation self-supervised learning foundation models robotics

Problem

Underwater stereo depth estimation suffers from severe domain shifts and data scarcity, making it difficult to adapt large vision foundation models or fuse monocular priors with fragile stereo correspondences without extensive labeled data.

Approach

The framework uses a LoRA-adapted monocular foundation encoder to generate coarse depth priors, which guide a recurrent GRU-based stereo refinement module trained entirely without dense labels.

Key results

State-of-the-art zero-shot RMSE of 2.8947 on TartanAir underwater subset
RMSE of 1.8843 on SQUID dataset with improved threshold accuracy
Dynamic LoRA strategy for efficient rank selection and adaptation
UW-StereoDepth-40K synthetic dataset and validated BlueROV2 deployment

Why it matters

Provides a scalable, label-free solution for accurate 3D perception, directly advancing autonomy and safety for underwater robotics and ROV operations.

Abstract

Underwater stereo depth estimation provides ac- curate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monoc- ular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foun- dation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale- ambiguous monocular priors with locally metric yet photo- metrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refine- ment module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach.

Index terms

Deep Learning for Visual Perception Transfer Learning Deep Learning Methods