3DME: Dual-Branch Encoder with Progressive Masking for 3D Medical Foundation Encoding Model
Hengyi Yuan, Zesheng Cheng, Huiru Chen, Wang Shixuan
AI summary
Problem
Adapting 2D foundation models to 3D medical data incurs prohibitive computational costs and fails to capture volumetric continuity, while existing 3D models lack cross-task generalization and rely heavily on task-specific annotations.
Approach
3DME employs a dual-branch encoder combining a Vision Transformer for global context and a 3D graph convolutional network for local voxel structures, trained via a progressive volumetric masking strategy that dynamically increases reconstruction difficulty to learn cross-slice continuity.
Key results
- State-of-the-art performance across 12 segmentation and classification tasks
- Strong zero-shot transfer capability without fine-tuning
- Superior BraTS and BTCV segmentation accuracy over existing baselines
- Unified framework supporting lightweight downstream adaptation
Why it matters
It provides a scalable, generalizable foundation for 3D medical image analysis, accelerating clinical deployment and enhancing spatial perception for robotic-assisted surgery.
Abstract
Three-dimensional (3D) medical image analysis faces challenges such as massive data volume, difficulty in integrating cross-slice information, and limited model generalization. This paper proposes 3DME, a foundational model for 3D medical imaging. Its core innovations feature a dual-branch 3D encoder that integrates a Vision Transformer for modeling global long-range dependencies and a 3D graph convolutional network for capturing local voxel structures, enhanced by multi-level deformable attention for cross-planar correlation; a progressive volumetric masking strategy for self-supervised pretraining, which dynamically adjusts masking ratios and block sizes to force the model to learn cross-slice continuity and global semantics; and a unified foundation model framework supporting lightweight adaptation for downstream tasks. Experiments demonstrate that 3DME achieves state-of-the-art (SOTA) performance on 12 segmentation and classification tasks, exhibiting strong zero-shot transfer capabilities, thereby significantly enhancing model generalization and clinical deployment efficiency.