← Back ICRA 2026

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru Huang

PDF

AI summary

Key figure (auto-extracted from paper)

SurgCUT3R adapts a general 3D reconstruction model to surgical video by generating pseudo-ground-truth data, applying hybrid supervision, and using a hierarchical inference framework to eliminate long-term pose drift while maintaining near-SOTA accuracy and real-time speed.

Surgical 3D reconstruction Monocular depth estimation Pose drift correction Pseudo-ground truth Hybrid supervision Endoscopic video

Problem

Adapting state-of-the-art unified 3D reconstruction models to monocular surgical video is blocked by a severe lack of supervised training data and accumulated pose drift over long sequences.

Approach

SurgCUT3R generates metric-scale pseudo-ground-truth depth from public stereo datasets, trains the model with a hybrid supervised and geometric self-correction loss, and employs a dual-model hierarchical inference pipeline to correct long-term camera drift.

Key results

Scalable pseudo-ground-truth depth generation from stereo surgical videos
Hybrid supervision strategy combining pseudo-GT with geometric self-correction
Hierarchical inference framework that suppresses accumulated pose drift
Near state-of-the-art accuracy with substantially faster inference on SCARED and StereoMIS datasets

Why it matters

Provides a practical, robust solution for real-time 3D surgical scene reconstruction, directly enabling safer intraoperative navigation and automated robotic surgery.

Abstract

Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general- purpose reconstruction models is constrained by two key chal- lenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground- truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust recon- struction in surgical environments. Project page: https://chumo- xu.github.io/SurgCUT3R-ICRA26/

Index terms

Computer Vision for Medical Robotics Medical Robots and Systems Computer Vision for Automation