← Back ICRA 2026

Surgical Video Understanding with Label Interpolation

Garam Kim, Tae Kyeong Jeong, Juyoun Park

PDF

AI summary

Key figure (auto-extracted from paper)

Optical flow-based label interpolation effectively balances sparse spatial and dense temporal annotations, significantly improving multi-task surgical video understanding accuracy and efficiency.

Surgical video understanding Multi-task learning Label interpolation Optical flow Robot-assisted surgery Step anticipation

Problem

Multi-task learning for surgical videos suffers from a severe temporal-spatial annotation imbalance, where long-term task labels are available per-frame but short-term spatial annotations are restricted to sparse key frames, leading to negative task interference and poor generalization.

Approach

The authors propose SurgMINT, a unified framework that uses optical flow to propagate segmentation labels from annotated key frames to adjacent unlabeled frames, enabling robust joint training of phase/step recognition, anticipation, and instrument/action detection.

Key results

Proposes SurgMINT, a unified multi-task framework for surgical video understanding
Introduces optical flow-based label interpolation to resolve temporal-spatial annotation imbalance
Improves instrument detection mAP and step recognition accuracy through enriched spatial supervision
Demonstrates that label interpolation mitigates negative task interference in joint training

Why it matters

Enables more accurate and efficient real-time surgical assistance by overcoming data scarcity in multi-task learning, directly benefiting robotic surgery developers and clinical researchers.

Abstract

Robot-assisted surgery (RAS) has become a crit- ical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally inva- sive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal–spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow–based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

Index terms

Medical Robots and Systems Computer Vision for Medical Robotics