← Back ICRA 2026

NeuroVLA: Surgical Scenario-Aware Learning of Debulking Skills in Endoscopic Robotic Neurosurgery Via Vision-Language-Action Model

Tat Ming Danny Chan, Hongbin Liu, Renzhi Wang, and Hongliang Ren

PDF

AI summary

Key figure (auto-extracted from paper)

NeuroVLA enables a surgical continuum robot to autonomously execute complex debulking tasks by conditioning a vision-language-action model on scenario-aware skill instructions.

Vision-Language-Action Surgical Robotics Debulking Task Continuum Robot Autonomous Surgery Scenario-Aware Learning

Problem

Training VLA models for surgical robots is hindered by a lack of domain-specific data on deformable tissues and robot kinematics, causing poor scene understanding and imprecise control in confined neurosurgical environments.

Approach

NeuroVLA integrates endoscopic images, robot states, and textual skill objectives into a vision-language-action backbone to guide a parallel continuum robot through sequential align, grasp, transfer, and release phases.

Key results

Reduced pixel distance error by at least 55% for align and transfer skills
Achieved mean pixel distances of 29.10 and 21.55 pixels for align and transfer phases
Attained 88.89% and 100% success rates for grasp and release skills
Published the AutoDebulk dataset of 90 debulking episodes for continuum robot training

Why it matters

Demonstrates the viability of scenario-aware VLA models for precise, autonomous control of continuum surgical robots, accelerating the development of safer neurosurgical automation.

Abstract

Robotic surgical systems have attracted widespread attention due to their accuracy and efficiency during operations. Recent studies have shown that the development of Vision-Language-Action (VLA) models offers greater potential to enable autonomous task completion in complex environments. However, the application of VLA models in surgical robotics is often limited by insufficient data on surgical environments and robot kinematics. As a result, models trained with limited data often lack a comprehensive understanding of the surgical scene and the robot’s behavior. In this paper, we propose NeuroVLA, a VLA model designed for the debulking task in neurosurgical robotic scenarios. We collected a dataset using a flexible parallel continuum robot in phantom-based debulking experiments. We formulate skill objectives in the debulking task as skill instructions in NeuroVLA. We develop a Vision-Language-Model-backboned scenario understanding within NeuroVLA to help the robot understand both the surgical debulking scenario and the robot itself through skill-based instruction. After training on 90 debulking episodes, NeuroVLA can infer corresponding actions from image observations, language instructions, and robot states for the four sequential skills of the debulking task. We evaluate NeuroVLA on the four skills defined in the debulking task: align, grasp, transfer, and release. Our approach reduces pixel distance error by at least 55 % and achieves mean pixel distances of 29.10 and 21.55 pixels in align and transfer skills, respectively. The success rates for grasp and release skills are 88.89 % and 100 %, respectively.

Index terms

Medical Robots and Systems Learning from Demonstration