← Back ICRA 2026

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Yuelin Zhang, Qingpeng Ding, Longxiang Tang, Chengyu Fang, Shing Shin Cheng

PDF

AI summary

Key figure (auto-extracted from paper)

A unified vision-language-action framework enables real-time, adaptive ultrasound-guided needle insertion that outperforms state-of-the-art trackers and manual operation in accuracy, success rate, and speed.

Vision-Language-Action Ultrasound-Guided Needle Insertion Robotic Ultrasound Real-Time Tracking Uncertainty-Aware Control Medical Robotics

Problem

Ultrasound-guided needle insertion is hindered by dynamic imaging conditions, occlusion, and intermittent needle invisibility, causing traditional hand-crafted automated pipelines to fail or degrade. There is a lack of generalizable, context-aware systems that can unify real-time tracking with adaptive control.

Approach

The framework integrates a dedicated Cross-Depth Fusion tracking head and a lightweight Tracking-Conditioning register with a pretrained vision backbone, coupled with an uncertainty-aware control policy and an asynchronous pipeline to enable real-time, adaptive needle insertion and tracking.

Key results

Cross-Depth Fusion tracking head enables end-to-end real-time needle tracking at ~25 FPS
Tracking-Conditioning register provides parameter-efficient adaptation of frozen vision backbones
Uncertainty-aware control policy dynamically adjusts insertion speed based on tip visibility
Consistently outperforms state-of-the-art trackers and manual operation in accuracy, success rate, and procedure time

Why it matters

Enables safer, more standardized, and operator-independent robotic ultrasound interventions, particularly beneficial for complex percutaneous procedures and resource-limited clinical settings.

Abstract

Ultrasound (US)-guided needle insertion is a crit- ical yet challenging procedure due to dynamic imaging condi- tions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, inte- grating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature condition- ing. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.

Index terms

Medical Robots and Systems Visual Servoing Visual Tracking