← Back ICRA 2026

CaLoRA-Stereo: Robust Stereo Endoscopic Depth Estimation Network Via Camera-Aware LoRA and Dual-View Geometry

Shixing Ma, Shuwei Shao, Zhaoxi Lin, xinzhe Du, Rui Song, Yibin Li, Max Q.-H. Meng, Zhe Min

PDF

AI summary

Key figure (auto-extracted from paper)

CaLoRA-Stereo adapts frozen stereo foundation models to surgical endoscopy using camera-aware LoRA and dual-view geometric constraints, achieving state-of-the-art depth accuracy with minimal trainable parameters.

Stereo depth estimation Endoscopic surgery LoRA adaptation Camera-aware scaling Geometric consistency Surgical foundation models

Problem

Pretrained stereo foundation models fail in minimally invasive surgery due to domain shifts like specular highlights, low-texture tissue, and intraoperative camera intrinsics drift.

Approach

The method injects a camera-specific scaling gate into LoRA modules to adapt updates based on focal length and baseline, while enforcing cross-view depth consistency and spectral structural alignment during training.

Key results

State-of-the-art depth accuracy on SCARED and Hamlyn datasets
Parameter-efficient adaptation requiring no backbone modifications
Robust handling of intraoperative camera intrinsics drift
Preservation of thin structures and sharp tissue boundaries in 3D reconstructions

Why it matters

Provides a reliable, computationally efficient depth estimation pipeline for surgical navigation, robotic control, and intraoperative 3D reconstruction.

Abstract

Stereo depth estimation has drawn widespread attention from the robotics community due to its broad appli- cations such as 3D reconstruction. Recently, stereo matching foundation models have made significant progress by being trained on the large-scale datasets containing natural images. However, directly leveraging these pretrained large models to minimally invasive surgery still remains challenging due to domain shifts in aspects of specular highlights and low- texture tissue. In this paper, we propose a parameter-efficient adaptation framework to address this gap. Specifically, we introduce Camera-Aware LoRA for fine-tuning Foundation- Stereo, using a camera-aware scaling gate computed from focal length and baseline to address intraoperative intrinsics drift arising from instrument self-heating and other thermal effects. We further develop a geometric consistency constraint and a spectral alignment regularizer that enforce cross-view depth agreement. Extensive experiments on the SCARED and Hamlyn datasets indicate that the proposed method achieves state-of- the-art performance. Notably, CaLoRA is easy to integrate into standard fine-tuning pipelines, requiring no backbone changes and only a small number of trainable parameters.

Index terms

Medical Robots and Systems Computer Vision for Medical Robotics