StereoMamba: Real-Time and Robust Intraoperative Stereo Disparity Estimation via Long-Range Spatial Dependencies
Xu Wang, Jialang Xu, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos Mazomenos
AI summary
Problem
Current deep learning methods for stereo disparity estimation in robotic-assisted minimally invasive surgery struggle to balance accuracy, robustness, and inference speed, often limited by CNN receptive fields or the high computational cost of Transformers.
Approach
The authors propose StereoMamba, which uses a Feature Extraction Mamba module to capture long-range spatial dependencies within and across stereo images, combined with a Multidimensional Feature Fusion module to efficiently integrate multi-scale features for cost volume construction.
Key results
- State-of-the-art EPE of 2.64 px and depth MAE of 2.55 mm on SCARED benchmark
- Real-time inference speed of 21.28 FPS for 1280×1024 image pairs
- Strong zero-shot generalization with SSIM of 0.8970 and PSNR of 16.08 on in-vivo datasets
- Second-best Bad2 (41.49%) and Bad3 (26.99%) error rates
Why it matters
Enables reliable, real-time depth perception for surgeons during robotic-assisted minimally invasive procedures, improving navigation precision and procedural safety.
Abstract
Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made sig- nificant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architec- ture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then intro- duce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior per- formance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280 × 1024), striking the optimum bal- ance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets. Received 16 April 2025; accepted 13 August 2025. Date of publication 1 September 2025; date of current version 10 September 2025. This article was recommended for publication by Associate Editor A. Kuntz and Editor J. Burgner-Kahrs upon evaluation of the reviewers’ comments. This work was sup- ported in part by EPSRC through the UCL Centre for Doctoral Training in Intel- ligent, Integrated Imaging in Healthcare (i4health) under Grant EP/S021930/1, in part by Human-centric Machine Intelligence to optimise Robotic Surgical Training under Grant EP/Z534754/1, in part by the Optical and Acoustic imaging for Surgical and Interventional Sciences under Grant UKRI145 projects, in part by UCL Research Excellence Scholarships Programme, in part by NIHR UCLH Biomedical Research Centre under Grant NIHR203328, and in part by the Department of Science, Innovation and Technology (DSIT) and the Royal Academy of Engineering through the Chair in Emerging Technologies programme. (Corresponding authors: Xu Wang; Evangelos B. Mazomenos.) Xu Wang, Jialang Xu, and Evangelos B. Mazomenos are with UCL Hawkes Institute and the Department of Medical Physics and Biomedical Engineering, University College London, W1W 7TY London, U.K. (e-mail: xu.wang.23@ucl.ac.uk; jialang.xu.22@ucl.ac.uk; e.mazomenos@ucl.ac.uk). Shuai Zhang and Danail Stoyanov are with UCL Hawkes Institute and the Department of Computer Science, University College London, W1W 7TY London, U.K. (e-mail: shuai.z@ucl.ac.uk; danail.stoyanov@ucl.ac.uk). Baoru Huang is with the Department of Computer Science, University of Liverpool, L69 7ZX Liverpool, U.K. (e-mail: Baoru.Huang@liverpool.ac.uk). Code is available at: https://github.com/MichaelWangGo/StereoMamba.git This article has supplementary downloadable material available at https://doi.org/10.1109/LRA.2025.3604749, provided by the authors. Digital Object Identifier 10.1109/LRA.2025.3604749