← Back ICRA 2026

LACY: A Vision-Language Model-Based Language-Action Cycle for Self-Improving Robotic Manipulation

Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi

PDF

AI summary

Key figure (auto-extracted from paper)

LACY enables robotic policies to self-improve by bidirectionally mapping between language and actions, boosting task success rates by over 50% without additional human data.

Vision-Language Models Robotic Manipulation Self-Supervised Learning Language-Action Cycle Active Data Augmentation Policy Generalization

Problem

Current vision-language-action models rely on unidirectional language-to-action mapping and massive passive datasets, which limits their generalization, data efficiency, and ability to explain or verify their own behavior.

Approach

LACY fine-tunes a single vision-language model to jointly generate actions from language, explain actions in language, and verify semantic consistency, creating a closed-loop cycle that autonomously generates and filters high-quality training data.

Key results

Unified VLM framework jointly trained for language-to-action, action-to-language, and consistency verification tasks
Self-improving data generation pipeline that autonomously creates and filters high-quality training samples via an L2A2L cycle
Confidence-based active data augmentation strategy that targets low-confidence scenarios to mitigate overfitting
50.56 percentage point average increase in task success rates over baselines in simulation and real-world pick-and-place tasks

Why it matters

It offers a scalable, data-efficient approach for robotic manipulation that reduces dependence on costly human demonstrations while improving policy robustness and interpretability.

Abstract

Learning generalizable policies for robotic manip- ulation increasingly relies on large-scale models that excel at mapping language instructions to actions (L2A). However, this unidirectional training paradigm often produces policies that can execute tasks without deeper contextual understanding, thereby limiting their ability to generalize and to explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic and robust grounding. An agent capable of both acting and explaining its actions can form richer internal representa- tions and, critically, unlock new paradigms for self-supervised learning. In this paper, we introduce LACY (Language-Action CYcle), a unified framework that learns such bidirectional map- pings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). The framework enables a self- improving cycle that autonomously generates new training data by chaining the L2A and A2L modules in an L2A2L pipeline. The L2C module then filters this data using an active data augmentation strategy that selectively targets low-confidence cases, thereby improving the model efficiently without requiring additional human annotations. Extensive experiments on pick- and-place tasks in both simulation and the real world demon- strate that LACY substantially improves task success rates by 50.56 percentage points on average compared to baseline methods and yields more robust language-action grounding for robotic manipulation. For more details, please refer to our project page: https://vla2026.github.io/LACY/

Index terms

Deep Learning in Grasping and Manipulation Grasping