Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren
Abstract
Despite the availability of computer-aided simula- tors and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing visual question answering (VQA) methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (i) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (ii) current fusion strategy of heterogeneous modalities like text and image is naive; (iii) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language em- bedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate generalized intersection over union (GIoU) loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from EndoVis-17 and 18 of the MICCAI challenges. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localized the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks. †L. Bai and M. Islam are co-first authors. *The work was supported by Hong Kong Research Grants Council (RGC) Collaborative Research Fund (CRF C4026-21GF and CRF C4063-18G), and General Research Fund (GRF #14211420 and GRF #14216022); Shun Hing Institute of Advanced Engineering (BME-p1-21/8115064) at the CUHK; and Shenzhen-Hong Kong-Macau Technology Research Programme (Type C) Grant 202108233000303 awarded to Dr. H. Ren. M. Islam was funded by EPSRC grant [EP/W00805X/1]. We thank the CUHK Vice-Chancellor’s Ph.D. Scholarship Scheme for conference travel support. (Corresponding author: Hongliang Ren) 1 L. Bai and H. Ren are with the Dept. of Electronic Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong, China; (E-mail: b.long@ieee.org) 2 M. Islam is with the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS), University College London, UK. (E-mail: mobarakol.islam@ucl.ac.uk) 3 L. Seenivasan and H. Ren are with Dept. of Biomedical Engi- neering, National University of Singapore, Singapore. (E-mail: lalithku- mar_s@u.nus.edu) 4 H. Ren is also with Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong 999077, China. (E-mail: hlren@ieee.org)