← Back ICRA 2023

Rethinking Feature Extraction: Gradient-Based Localized Feature Extraction for End-To-End Surgical Downstream Tasks

Winnie Pang, Mobarakol Islam, Sai Mitheran Jagadesh Kumar, Lalithkumar Seenivasan, Mengya Xu, Hongliang Ren

PDF

Abstract

Several approaches have been introduced to under- stand surgical scenes through downstream tasks like captioning and surgical scene graph generation. However, most of them heavily rely on an independent object detector and region-based feature extractor. Encompassing computationally expensive de- tection and feature extraction models, these multi-stage methods suffer from slow inference speed and inheriting errors from the earlier stages which limit the real-time applications and degrade the performance respectively. This work develops a detector-free gradient-based localized feature extraction approach that enables end-to-end model training for downstream surgical tasks such as report generation and tool-tissue interaction graph prediction. We eliminate the need for object detection or region proposal and feature extraction networks by extracting the features of interest from the discriminative regions using gradient-based localization techniques (e.g., Grad-CAM). We show that our pro- posed approaches enable the real-time deployment of end-to-end models for surgical downstream tasks. We extensively validate our approach on two surgical tasks: captioning and scene graph generation. The results prove that our gradient-based localized feature extraction methods effectively substitute the detector and feature extractor networks, allowing end-to-end model develop- ment with faster inference speed, essential for real-time surgical scene understanding tasks. The code is publicly available at https: //github.com/PangWinnie0219/GradCAMDownstreamTask.

Index terms

Surgical Robotics: Laparoscopy Deep Learning for Visual Perception Semantic Scene Understanding