← Back ICRA 2026

StereoMamba: Real-Time and Robust Intraoperative Stereo Disparity Estimation via Long-Range Spatial Dependencies

Xu Wang, Jialang Xu, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos Mazomenos

PDF

AI summary

Key figure (auto-extracted from paper)

StereoMamba achieves an optimal balance of accuracy, robustness, and real-time speed for intraoperative stereo disparity estimation in robotic surgery.

Stereo disparity estimation robotic surgery Mamba architecture real-time inference cost volume zero-shot generalization

Problem

Current deep learning methods for stereo disparity estimation in robotic-assisted minimally invasive surgery struggle to balance accuracy, robustness, and inference speed, often limited by CNN receptive fields or the high computational cost of Transformers.

Approach

The authors propose StereoMamba, which uses a Feature Extraction Mamba module to capture long-range spatial dependencies within and across stereo images, combined with a Multidimensional Feature Fusion module to efficiently integrate multi-scale features for cost volume construction.

Key results

State-of-the-art EPE of 2.64 px and depth MAE of 2.55 mm on SCARED benchmark
Real-time inference speed of 21.28 FPS for 1280×1024 image pairs
Strong zero-shot generalization with SSIM of 0.8970 and PSNR of 16.08 on in-vivo datasets
Second-best Bad2 (41.49%) and Bad3 (26.99%) error rates

Why it matters

Enables reliable, real-time depth perception for surgeons during robotic-assisted minimally invasive procedures, improving navigation precision and procedural safety.

Abstract

Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made sig- nificant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architec- ture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then intro- duce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior per- formance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280 × 1024), striking the optimum bal- ance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets. Received 16 April 2025; accepted 13 August 2025. Date of publication 1 September 2025; date of current version 10 September 2025. This article was recommended for publication by Associate Editor A. Kuntz and Editor J. Burgner-Kahrs upon evaluation of the reviewers’ comments. This work was sup- ported in part by EPSRC through the UCL Centre for Doctoral Training in Intel- ligent, Integrated Imaging in Healthcare (i4health) under Grant EP/S021930/1, in part by Human-centric Machine Intelligence to optimise Robotic Surgical Training under Grant EP/Z534754/1, in part by the Optical and Acoustic imaging for Surgical and Interventional Sciences under Grant UKRI145 projects, in part by UCL Research Excellence Scholarships Programme, in part by NIHR UCLH Biomedical Research Centre under Grant NIHR203328, and in part by the Department of Science, Innovation and Technology (DSIT) and the Royal Academy of Engineering through the Chair in Emerging Technologies programme. (Corresponding authors: Xu Wang; Evangelos B. Mazomenos.) Xu Wang, Jialang Xu, and Evangelos B. Mazomenos are with UCL Hawkes Institute and the Department of Medical Physics and Biomedical Engineering, University College London, W1W 7TY London, U.K. (e-mail: xu.wang.23@ucl.ac.uk; jialang.xu.22@ucl.ac.uk; e.mazomenos@ucl.ac.uk). Shuai Zhang and Danail Stoyanov are with UCL Hawkes Institute and the Department of Computer Science, University College London, W1W 7TY London, U.K. (e-mail: shuai.z@ucl.ac.uk; danail.stoyanov@ucl.ac.uk). Baoru Huang is with the Department of Computer Science, University of Liverpool, L69 7ZX Liverpool, U.K. (e-mail: Baoru.Huang@liverpool.ac.uk). Code is available at: https://github.com/MichaelWangGo/StereoMamba.git This article has supplementary downloadable material available at https://doi.org/10.1109/LRA.2025.3604749, provided by the authors. Digital Object Identifier 10.1109/LRA.2025.3604749

Index terms

Deep Learning for Visual Perception Computer Vision for Medical Robotics Computer Vision for Automation