Surgical Video Understanding with Label Interpolation
Garam Kim, Tae Kyeong Jeong, Juyoun Park
AI summary
Problem
Multi-task learning for surgical videos suffers from a severe temporal-spatial annotation imbalance, where long-term task labels are available per-frame but short-term spatial annotations are restricted to sparse key frames, leading to negative task interference and poor generalization.
Approach
The authors propose SurgMINT, a unified framework that uses optical flow to propagate segmentation labels from annotated key frames to adjacent unlabeled frames, enabling robust joint training of phase/step recognition, anticipation, and instrument/action detection.
Key results
- Proposes SurgMINT, a unified multi-task framework for surgical video understanding
- Introduces optical flow-based label interpolation to resolve temporal-spatial annotation imbalance
- Improves instrument detection mAP and step recognition accuracy through enriched spatial supervision
- Demonstrates that label interpolation mitigates negative task interference in joint training
Why it matters
Enables more accurate and efficient real-time surgical assistance by overcoming data scarcity in multi-task learning, directly benefiting robotic surgery developers and clinical researchers.
Abstract
Robot-assisted surgery (RAS) has become a crit- ical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally inva- sive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal–spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow–based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.