← Back ICRA 2026

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human�Robot Interaction

Yandu Chen, Kefan Gu, Yuqing Wen, Yucheng Zhao, Tiancai Wang, Liqiang Nie

PDF

AI summary

Key figure (auto-extracted from paper)

IntentionVLA enables robots to accurately interpret implicit human intentions and execute actions in real-time by combining curriculum-trained reasoning with compact, diffusion-guided action generation.

Vision-Language-Action models Embodied intention reasoning Curriculum training Diffusion-based action generation Human-robot interaction Real-time inference

Problem

Current Vision-Language-Action models lack reasoning-intensive pretraining and fail to interpret implicit human intentions, causing them to struggle with contextual understanding and accurate execution in complex, real-world interactions.

Approach

The method employs a two-stage curriculum training paradigm that first equips a VLM backbone with embodied intention and spatial reasoning capabilities, then distills these into compact reasoning cues to guide a diffusion-based action generator for fast inference.

Key results

18% higher success rate than π0 and 28% higher than ECoT under intention instructions
Doubles baseline success rates on out-of-distribution tasks
Enables zero-shot human-robot interaction with 40% success rate
Automated pipeline for generating intention, spatial, and compact reasoning data

Why it matters

Provides a scalable, real-time framework for next-generation human-robot interaction systems that require accurate interpretation of ambiguous, intention-driven commands.

Abstract

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple percep- tion with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning- guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real- world interactions. To overcome these limitations, we propose IntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reason- ing and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms π0, achieving 18% higher success rates with direct instructions and 28% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human- robot interaction with 40% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

Index terms

Deep Learning in Grasping and Manipulation Deep Learning Methods AI-Based Methods