Stream-To-Act: ROS 2 Native Token Streaming for Continuous Motion Execution of Vision-Language-Action Models
Dahyun Kim, Yunseong Jeon, Hongkyun Park, Jong-Chan Kim
AI summary
Problem
Existing VLM serving frameworks rely on HTTP/JSON pipelines with Base64 encoding, causing high latency and control stalls during streaming, while naive partial-plan execution leads to stop-and-go behavior in robotic control loops.
Approach
We extend llama-ros with a ROS 2-native streaming pipeline that removes Base64/JSON overhead and introduces a Stream-to-Act mechanism that commits partial plans immediately using an optimal start-time policy to ensure continuous execution.
Key results
- Reduced end-to-end latency from ~102 ms to ~15.5 ms on Jetson AGX Orin
- Eliminated stop-and-go behavior in Carla driving scenarios via optimal partial-plan commitment
- Quantified token generation throughput across RTX and Jetson platforms
- Derived a stall-free boundary formula for naive vs. Stream-to-Act execution
Why it matters
Enables reliable, real-time integration of large vision-language models into resource-constrained robotic systems without sacrificing control smoothness or safety.
Abstract
Vision-Language Models (VLMs) are increasingly used in robotics for natural language understanding and ex- ecutable plan generation, yet integrating them into real-time control pipelines remains challenging. Many existing systems rely on HTTP/JSON-based inference interfaces that require repeated Base64 serialization, introducing unnecessary over- head and increasing end-to-end latency. At the execution level, waiting for a full plan leads to stalls where no valid actions are available, while naive streaming of partial plans produces stop-and-go behavior due to token arrival gaps. To address these issues, we extend llama-ros with Stream-to-Act, a ROS 2- native execution mechanism that begins acting once sufficient tokens arrive while ensuring continuous execution through an optimal start-time policy. Our open-source implementation is evaluated on RTX GPUs and NVIDIA Jetson platforms through end-to-end latency analysis, token generation throughput mea- surements, and execution timeline visualization. In addition, a Carla-based driving scenario illustrates how the proposed execution policy eliminates stop-and-go behavior and maintains continuous control, even when the total plan generation time remains unchanged.