← Back ICRA 2026

Stream-To-Act: ROS 2 Native Token Streaming for Continuous Motion Execution of Vision-Language-Action Models

Dahyun Kim, Yunseong Jeon, Hongkyun Park, Jong-Chan Kim

PDF

AI summary

Key figure (auto-extracted from paper)

Eliminating HTTP/JSON serialization overhead and committing partial VLM plans early enables stall-free, continuous robotic control on embedded hardware.

ROS 2 Vision-Language Models Streaming Inference Real-time Robotics Latency Reduction Continuous Control

Problem

Existing VLM serving frameworks rely on HTTP/JSON pipelines with Base64 encoding, causing high latency and control stalls during streaming, while naive partial-plan execution leads to stop-and-go behavior in robotic control loops.

Approach

We extend llama-ros with a ROS 2-native streaming pipeline that removes Base64/JSON overhead and introduces a Stream-to-Act mechanism that commits partial plans immediately using an optimal start-time policy to ensure continuous execution.

Key results

Reduced end-to-end latency from ~102 ms to ~15.5 ms on Jetson AGX Orin
Eliminated stop-and-go behavior in Carla driving scenarios via optimal partial-plan commitment
Quantified token generation throughput across RTX and Jetson platforms
Derived a stall-free boundary formula for naive vs. Stream-to-Act execution

Why it matters

Enables reliable, real-time integration of large vision-language models into resource-constrained robotic systems without sacrificing control smoothness or safety.

Abstract

Vision-Language Models (VLMs) are increasingly used in robotics for natural language understanding and ex- ecutable plan generation, yet integrating them into real-time control pipelines remains challenging. Many existing systems rely on HTTP/JSON-based inference interfaces that require repeated Base64 serialization, introducing unnecessary over- head and increasing end-to-end latency. At the execution level, waiting for a full plan leads to stalls where no valid actions are available, while naive streaming of partial plans produces stop-and-go behavior due to token arrival gaps. To address these issues, we extend llama-ros with Stream-to-Act, a ROS 2- native execution mechanism that begins acting once sufficient tokens arrive while ensuring continuous execution through an optimal start-time policy. Our open-source implementation is evaluated on RTX GPUs and NVIDIA Jetson platforms through end-to-end latency analysis, token generation throughput mea- surements, and execution timeline visualization. In addition, a Carla-based driving scenario illustrates how the proposed execution policy eliminates stop-and-go behavior and maintains continuous control, even when the total plan generation time remains unchanged.

Index terms

Software Architecture for Robotic and Automation Software Tools for Robot Programming Software Middleware and Programming Environments