SurgAM: Surgical Affordance Map Prediction with Multimodal Feature Fusion for Robot Autonomy
Lei Song, Yonghao Long, Mengya Xu, Jiayi Geng, Xiuyuan Chen, Qi Dou
AI summary
Problem
Surgical automation lacks a direct bridge between visual scene understanding and actionable robotic guidance, as existing perception methods fail to identify where and how a robot should safely interact with dynamic tissue.
Approach
The framework fuses semantic features from a vision transformer with spatial features from a diffusion model, using hierarchical prompt learning and scene-guided attention to generate actionable maps for specific surgical tasks.
Key results
- First comprehensive study on surgical affordance map prediction
- Novel adaptive multimodal feature fusion framework with hierarchical prompt learning
- New dataset with annotations for aspiration, clipping, and retraction
- State-of-the-art prediction accuracy and successful phantom-based autonomous manipulation
Why it matters
Advances surgical robotics by providing interpretable, actionable guidance that bridges perception and manipulation, crucial for developing safer autonomous surgical systems.
Abstract
Surgical automation is being increasingly studied, yet bridging visual scene understanding with autonomous action planning remains a fundamental challenge. While much research effort has been made on scene perception (e.g., tool recognition and scene segmentation), understanding and predicting actionable possibilities for surgical automation is still underexplored. In this paper, we introduce surgical affordance prediction, which identifies actionable regions for fundamental surgical actions from visual data. Specifically, a novel adaptive feature fusion framework is proposed that leverages the com- plementary strengths of a self-supervised vision transformer encoder for its superior semantic understanding and a large- scale generative model encoder for its spatially-aware capability. Furthermore, we introduce a hierarchical prompt learning mechanism to adapt to varying procedural contexts. Finally, a scene-guided attention decoder is proposed to focus on critical surgical areas while suppressing background distractions. To validate the effectiveness, we established a new dataset, derived from publicly available surgical datasets with affordance anno- tations for three basic surgical actions: aspiration, clipping, and retraction. Extensive experiments demonstrate that our approach achieves state-of-the-art performance. Moreover, we validate our frameworkâs applicability for downstream automa- tion on a realistic lung and prostate phantom, and results show that the predicted affordance maps successfully enable autonomous surgical actions.