NeuroVLA: Surgical Scenario-Aware Learning of Debulking Skills in Endoscopic Robotic Neurosurgery Via Vision-Language-Action Model
Tat Ming Danny Chan, Hongbin Liu, Renzhi Wang, and Hongliang Ren
AI summary
Problem
Training VLA models for surgical robots is hindered by a lack of domain-specific data on deformable tissues and robot kinematics, causing poor scene understanding and imprecise control in confined neurosurgical environments.
Approach
NeuroVLA integrates endoscopic images, robot states, and textual skill objectives into a vision-language-action backbone to guide a parallel continuum robot through sequential align, grasp, transfer, and release phases.
Key results
- Reduced pixel distance error by at least 55% for align and transfer skills
- Achieved mean pixel distances of 29.10 and 21.55 pixels for align and transfer phases
- Attained 88.89% and 100% success rates for grasp and release skills
- Published the AutoDebulk dataset of 90 debulking episodes for continuum robot training
Why it matters
Demonstrates the viability of scenario-aware VLA models for precise, autonomous control of continuum surgical robots, accelerating the development of safer neurosurgical automation.
Abstract
Robotic surgical systems have attracted widespread attention due to their accuracy and efficiency during operations. Recent studies have shown that the development of Vision-Language-Action (VLA) models offers greater potential to enable autonomous task completion in complex environments. However, the application of VLA models in surgical robotics is often limited by insufficient data on surgical environments and robot kinematics. As a result, models trained with limited data often lack a comprehensive understanding of the surgical scene and the robot’s behavior. In this paper, we propose NeuroVLA, a VLA model designed for the debulking task in neurosurgical robotic scenarios. We collected a dataset using a flexible parallel continuum robot in phantom-based debulking experiments. We formulate skill objectives in the debulking task as skill instructions in NeuroVLA. We develop a Vision-Language-Model-backboned scenario understanding within NeuroVLA to help the robot understand both the surgical debulking scenario and the robot itself through skill-based instruction. After training on 90 debulking episodes, NeuroVLA can infer corresponding actions from image observations, language instructions, and robot states for the four sequential skills of the debulking task. We evaluate NeuroVLA on the four skills defined in the debulking task: align, grasp, transfer, and release. Our approach reduces pixel distance error by at least 55 % and achieves mean pixel distances of 29.10 and 21.55 pixels in align and transfer skills, respectively. The success rates for grasp and release skills are 88.89 % and 100 %, respectively.