LACY: A Vision-Language Model-Based Language-Action Cycle for Self-Improving Robotic Manipulation
Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi
AI summary
Problem
Current vision-language-action models rely on unidirectional language-to-action mapping and massive passive datasets, which limits their generalization, data efficiency, and ability to explain or verify their own behavior.
Approach
LACY fine-tunes a single vision-language model to jointly generate actions from language, explain actions in language, and verify semantic consistency, creating a closed-loop cycle that autonomously generates and filters high-quality training data.
Key results
- Unified VLM framework jointly trained for language-to-action, action-to-language, and consistency verification tasks
- Self-improving data generation pipeline that autonomously creates and filters high-quality training samples via an L2A2L cycle
- Confidence-based active data augmentation strategy that targets low-confidence scenarios to mitigate overfitting
- 50.56 percentage point average increase in task success rates over baselines in simulation and real-world pick-and-place tasks
Why it matters
It offers a scalable, data-efficient approach for robotic manipulation that reduces dependence on costly human demonstrations while improving policy robustness and interpretability.
Abstract
Learning generalizable policies for robotic manip- ulation increasingly relies on large-scale models that excel at mapping language instructions to actions (L2A). However, this unidirectional training paradigm often produces policies that can execute tasks without deeper contextual understanding, thereby limiting their ability to generalize and to explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic and robust grounding. An agent capable of both acting and explaining its actions can form richer internal representa- tions and, critically, unlock new paradigms for self-supervised learning. In this paper, we introduce LACY (Language-Action CYcle), a unified framework that learns such bidirectional map- pings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). The framework enables a self- improving cycle that autonomously generates new training data by chaining the L2A and A2L modules in an L2A2L pipeline. The L2C module then filters this data using an active data augmentation strategy that selectively targets low-confidence cases, thereby improving the model efficiently without requiring additional human annotations. Extensive experiments on pick- and-place tasks in both simulation and the real world demon- strate that LACY substantially improves task success rates by 50.56 percentage points on average compared to baseline methods and yields more robust language-action grounding for robotic manipulation. For more details, please refer to our project page: https://vla2026.github.io/LACY/