← Back ICRA 2026

ColonAdapter: Geometry Estimation through Foundation Model Adaptation for Colonoscopy

Zhiyi Jiang, Yifu Wang, Xuelian Cheng, Zongyuan Ge

PDF

AI summary

Key figure (auto-extracted from paper)

ColonAdapter adapts 3D geometric foundation models to colonoscopy via self-supervised fine-tuning, achieving state-of-the-art depth and pose estimation without ground-truth camera parameters.

3D geometry estimation colonoscopy foundation models self-supervised learning monocular depth medical robotics

Problem

Monocular colonoscopy images lack spatial information and contain challenging features like textureless regions, moving light sources, and non-Lambertian surfaces, causing existing 3D geometric foundation models to fail in clinical scenes.

Approach

The method uses a self-supervised fine-tuning strategy to adapt 3D geometric foundation models for colonoscopy. It incorporates a Detail Restoration Module to recover fine details, a confidence-weighted photometric loss for training stability, and a geometry consistency loss to maintain scale coherence across frames.

Key results

State-of-the-art camera pose estimation on synthetic and real colonoscopy datasets
Accurate monocular depth prediction in low-texture and dynamic lighting conditions
High-fidelity dense 3D point map reconstruction without ground-truth intrinsics
Stable training convergence through confidence-weighted photometric and geometry consistency losses

Why it matters

Enables reliable 3D spatial awareness for colonoscopy procedures, potentially improving clinical navigation, surgical planning, and AI-assisted diagnostics.

Abstract

Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.

Index terms

Deep Learning for Visual Perception Computer Vision for Medical Robotics Localization