Research Analyzer
← Back ICRA 2026

3DME: Dual-Branch Encoder with Progressive Masking for 3D Medical Foundation Encoding Model

Hengyi Yuan, Zesheng Cheng, Huiru Chen, Wang Shixuan

PDF

AI summary

Key figure (auto-extracted from paper)
3DME achieves state-of-the-art performance across 12 segmentation and classification tasks while demonstrating strong zero-shot transfer capabilities for 3D medical imaging.
3D medical imaging foundation model dual-branch encoder progressive masking zero-shot transfer surgical robotics

Problem

Adapting 2D foundation models to 3D medical data incurs prohibitive computational costs and fails to capture volumetric continuity, while existing 3D models lack cross-task generalization and rely heavily on task-specific annotations.

Approach

3DME employs a dual-branch encoder combining a Vision Transformer for global context and a 3D graph convolutional network for local voxel structures, trained via a progressive volumetric masking strategy that dynamically increases reconstruction difficulty to learn cross-slice continuity.

Key results

  • State-of-the-art performance across 12 segmentation and classification tasks
  • Strong zero-shot transfer capability without fine-tuning
  • Superior BraTS and BTCV segmentation accuracy over existing baselines
  • Unified framework supporting lightweight downstream adaptation

Why it matters

It provides a scalable, generalizable foundation for 3D medical image analysis, accelerating clinical deployment and enhancing spatial perception for robotic-assisted surgery.

Abstract

Three-dimensional (3D) medical image analysis faces challenges such as massive data volume, difficulty in integrating cross-slice information, and limited model generalization. This paper proposes 3DME, a foundational model for 3D medical imaging. Its core innovations feature a dual-branch 3D encoder that integrates a Vision Transformer for modeling global long-range dependencies and a 3D graph convolutional network for capturing local voxel structures, enhanced by multi-level deformable attention for cross-planar correlation; a progressive volumetric masking strategy for self-supervised pretraining, which dynamically adjusts masking ratios and block sizes to force the model to learn cross-slice continuity and global semantics; and a unified foundation model framework supporting lightweight adaptation for downstream tasks. Experiments demonstrate that 3DME achieves state-of-the-art (SOTA) performance on 12 segmentation and classification tasks, exhibiting strong zero-shot transfer capabilities, thereby significantly enhancing model generalization and clinical deployment efficiency.

Index terms

Visual Learning Deep Learning for Visual Perception Computer Vision for Medical Robotics

Related papers