SurgVidLM: Towards Multi-Grained Video Understanding with Large Language Model in Robot-Assisted Surgery
Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab and Hongliang Ren
AI summary
Problem
Current multimodal models lack the ability to perform fine-grained temporal reasoning on surgical videos, while existing datasets are limited in scale, accessibility, or lack detailed procedural annotations.
Approach
The authors propose SurgVidLM, which uses a two-stage StageFocus mechanism to first extract global surgical context and then perform high-frequency local analysis, enhanced by a Multi-frequency Fusion Attention module to blend coarse and fine visual tokens.
Key results
- Constructed SVU-31K, a 31K-pair multi-grained surgical video dataset
- Introduced the StageFocus mechanism for progressive global-to-local reasoning
- Designed Multi-frequency Fusion Attention to integrate low- and high-frequency visual tokens
- Achieved state-of-the-art performance on full and fine-grained surgical video understanding benchmarks
Why it matters
It enables precise, context-aware analysis of complex surgical procedures, advancing surgical training and robotic decision-making systems.
Abstract
Surgical scene understanding is critical for sur- gical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Mod- els (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical proce- dure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understand- ing and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local anal- ysis guided by temporal cues. We also develop the Multi- frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid- LLMs of comparable parameter scale in both full and fine- grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset are publicly available at https://github.com/gkw0010/SurgVidLM.