← Back ICRA 2026

CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

Nan Sun, Yongchang Li, Chenxu Wang, Bo Mao, Huiying Li, jiahe yao, kanghao li, Yifan zhang, Jian Liu, Guoying Zhang, Di Guo, Huaping Liu

PDF

AI summary

Key figure (auto-extracted from paper)

CollabVLA transforms standard robot policies into collaborative assistants by integrating self-reflective reasoning with diffusion-based action generation, cutting latency and boosting success rates through just-in-time human guidance.

Vision-Language-Action Self-Reflection Human-in-the-Loop Diffusion Policies Mixture-of-Experts Collaborative Robotics

Problem

Prior vision-language-action models suffer from domain overfitting, non-interpretable reasoning, and high latency from auxiliary generative models, while lacking mechanisms for real-time failure recognition and interactive correction.

Approach

The framework couples a vision-language model backbone with a diffusion-based action generator under a mixture-of-experts design, enabling the robot to explicitly reflect on its state and proactively ask humans for brief guidance when uncertain or failing.

Key results

Cuts normalized execution time by ~2× and generative planning steps by ~4× compared to prior methods
Achieves higher task success rates while maintaining low inference latency
Enables explicit self-reflection and calibrated uncertainty detection to trigger just-in-time human queries
Preserves strong multimodal understanding and grounding without degrading visuomotor performance

Why it matters

It provides a practical, unified framework for making robot policies transparent, robust, and collaboratively assistive, bridging the gap between autonomous control and human-in-the-loop interaction.

Abstract

In this work, we present CollabVLA, a self- reflective vision–language–action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM- based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ∼2× and Dream counts by ∼4× vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.

Index terms

Human-Robot Collaboration AI-Based Methods AI-Enabled Robotics