← Back ICRA 2026

Open-World Object Manipulation with Vision-Language-Action Models Via Synthetic Multi-Modal Data

Yefei Chen, Junjie Wen, Jinming LI, Zhongyi Zhou, Yaxin Peng, Chaomin Shen, Yi Xu, Yichen Zhu

PDF

AI summary

Key figure (auto-extracted from paper)

Co-finetuning VLA models with automatically synthesized image-text data and localization metadata enables zero-shot generalization to 100 novel objects without task-specific retraining.

Vision-Language-Action models object generalization synthetic data zero-shot manipulation robotic learning localization reasoning

Problem

VLA models struggle to generalize learned manipulation skills to novel, unseen objects because heavy reliance on teleoperated robot data causes catastrophic forgetting of pre-trained vision-language knowledge, limiting scalability in dynamic environments.

Approach

The Search2Scene pipeline automatically searches for object images, generates multi-view 3D representations, composes them into contextual scenes, and pairs them with bounding box annotations to co-finetune a VLA model alongside robot interaction data enriched with localization reasoning.

Key results

100% success rate on in-distribution objects for the 'move to' task
64% success rate on 100 out-of-distribution objects, surpassing the π0 baseline by 20 percentage points
Successful skill transfer to pushing and rotating unseen objects
Bounding box grounding and localization reasoning enable zero-shot generalization

Why it matters

Enables scalable, zero-shot object generalization for robotic manipulation, reducing reliance on extensive human demonstrations and paving the way for more flexible real-world robotic systems.

Abstract

Imitation learning has proven to be highly effec- tive in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of robot data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization—where a robot trained to perform a task with one object, such as “hand over the apple.” struggles to transfer its skills to a semantically similar but visually different object, such as “hand over the peach.” This gap in generalization to new objects beyond those in the same category has yet to be ad- equately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision- Language-Action (VLA) models, referred to as ObjectVLA. We design a lightweight image-text-data-synthesis pipeline, 1School of Computer Science, East China Normal University, 2Department of Mathematics, School of Science, Shanghai University, 3University of Toronto 4 Midea Group †Corresponding authors This work was done while Yefei Chen, Junjie Wen, Jinming Li, Zhongyi Zhou and Yichen Zhu were at Midea Group. Search2Scene, which enables robots to generalize learned skills to novel objects without requiring explicit human demonstra- tions for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate in selecting objects not seen during training. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.

Index terms

Imitation Learning Data Sets for Robot Learning