← Back ICRA 2026

SurgVidLM: Towards Multi-Grained Video Understanding with Large Language Model in Robot-Assisted Surgery

Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab and Hongliang Ren

PDF

AI summary

Key figure (auto-extracted from paper)

SurgVidLM outperforms existing video-language models in surgical scene comprehension by combining a novel multi-grained dataset with a two-stage, multi-frequency reasoning architecture.

Surgical video understanding Video-language models Multi-grained reasoning Robot-assisted surgery SVU-31K dataset StageFocus mechanism

Problem

Current multimodal models lack the ability to perform fine-grained temporal reasoning on surgical videos, while existing datasets are limited in scale, accessibility, or lack detailed procedural annotations.

Approach

The authors propose SurgVidLM, which uses a two-stage StageFocus mechanism to first extract global surgical context and then perform high-frequency local analysis, enhanced by a Multi-frequency Fusion Attention module to blend coarse and fine visual tokens.

Key results

Constructed SVU-31K, a 31K-pair multi-grained surgical video dataset
Introduced the StageFocus mechanism for progressive global-to-local reasoning
Designed Multi-frequency Fusion Attention to integrate low- and high-frequency visual tokens
Achieved state-of-the-art performance on full and fine-grained surgical video understanding benchmarks

Why it matters

It enables precise, context-aware analysis of complex surgical procedures, advancing surgical training and robotic decision-making systems.

Abstract

Surgical scene understanding is critical for sur- gical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Mod- els (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical proce- dure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understand- ing and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local anal- ysis guided by temporal cues. We also develop the Multi- frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid- LLMs of comparable parameter scale in both full and fine- grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset are publicly available at https://github.com/gkw0010/SurgVidLM.

Index terms

Semantic Scene Understanding Computer Vision for Medical Robotics Medical Robots and Systems