← Back ICRA 2026

SurgAM: Surgical Affordance Map Prediction with Multimodal Feature Fusion for Robot Autonomy

Lei Song, Yonghao Long, Mengya Xu, Jiayi Geng, Xiuyuan Chen, Qi Dou

PDF

AI summary

Key figure (auto-extracted from paper)

SurgAM successfully predicts actionable surgical regions by fusing semantic and spatial features, enabling state-of-the-art affordance mapping and validated autonomous robotic manipulation.

Surgical affordance robotic surgery multimodal fusion autonomous manipulation vision-language models

Problem

Surgical automation lacks a direct bridge between visual scene understanding and actionable robotic guidance, as existing perception methods fail to identify where and how a robot should safely interact with dynamic tissue.

Approach

The framework fuses semantic features from a vision transformer with spatial features from a diffusion model, using hierarchical prompt learning and scene-guided attention to generate actionable maps for specific surgical tasks.

Key results

First comprehensive study on surgical affordance map prediction
Novel adaptive multimodal feature fusion framework with hierarchical prompt learning
New dataset with annotations for aspiration, clipping, and retraction
State-of-the-art prediction accuracy and successful phantom-based autonomous manipulation

Why it matters

Advances surgical robotics by providing interpretable, actionable guidance that bridges perception and manipulation, crucial for developing safer autonomous surgical systems.

Abstract

Surgical automation is being increasingly studied, yet bridging visual scene understanding with autonomous action planning remains a fundamental challenge. While much research effort has been made on scene perception (e.g., tool recognition and scene segmentation), understanding and predicting actionable possibilities for surgical automation is still underexplored. In this paper, we introduce surgical affordance prediction, which identifies actionable regions for fundamental surgical actions from visual data. Specifically, a novel adaptive feature fusion framework is proposed that leverages the com- plementary strengths of a self-supervised vision transformer encoder for its superior semantic understanding and a large- scale generative model encoder for its spatially-aware capability. Furthermore, we introduce a hierarchical prompt learning mechanism to adapt to varying procedural contexts. Finally, a scene-guided attention decoder is proposed to focus on critical surgical areas while suppressing background distractions. To validate the effectiveness, we established a new dataset, derived from publicly available surgical datasets with affordance anno- tations for three basic surgical actions: aspiration, clipping, and retraction. Extensive experiments demonstrate that our approach achieves state-of-the-art performance. Moreover, we validate our framework’s applicability for downstream automa- tion on a realistic lung and prostate phantom, and results show that the predicted affordance maps successfully enable autonomous surgical actions.

Index terms

Surgical Robotics: Laparoscopy AI-Enabled Robotics