CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human
Nan Sun, Yongchang Li, Chenxu Wang, Bo Mao, Huiying Li, jiahe yao, kanghao li, Yifan zhang, Jian Liu, Guoying Zhang, Di Guo, Huaping Liu
AI summary
Problem
Prior vision-language-action models suffer from domain overfitting, non-interpretable reasoning, and high latency from auxiliary generative models, while lacking mechanisms for real-time failure recognition and interactive correction.
Approach
The framework couples a vision-language model backbone with a diffusion-based action generator under a mixture-of-experts design, enabling the robot to explicitly reflect on its state and proactively ask humans for brief guidance when uncertain or failing.
Key results
- Cuts normalized execution time by ~2× and generative planning steps by ~4× compared to prior methods
- Achieves higher task success rates while maintaining low inference latency
- Enables explicit self-reflection and calibrated uncertainty detection to trigger just-in-time human queries
- Preserves strong multimodal understanding and grounding without degrading visuomotor performance
Why it matters
It provides a practical, unified framework for making robot policies transparent, robust, and collaboratively assistive, bridging the gap between autonomous control and human-in-the-loop interaction.
Abstract
In this work, we present CollabVLA, a self- reflective vision–language–action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM- based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ∼2× and Dream counts by ∼4× vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.