← Back ICRA 2026

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Yiwen Hou, Boren Zheng, Chenrui Tie, Jiajun Deng, Lin Shao

PDF

AI summary

Key figure (auto-extracted from paper)

Goal-VLA enables zero-shot robot manipulation by using image-generative VLMs as object-centric world models to synthesize and refine goal states, bypassing the need for paired action data.

Zero-shot manipulation Vision-Language-Action models Object-centric world models Image generation Spatial grounding Robotic generalization

Problem

Current Vision-Language-Action models struggle with zero-shot generalization because they either depend on costly paired action data or lack precise spatial reasoning for complex manipulation tasks.

Approach

Goal-VLA decouples high-level semantic planning from low-level control by using an image-generative VLM to create and iteratively refine a goal image, which is then translated into precise 3D object transformations for training-free execution.

Key results

Introduces a decoupled hierarchical framework leveraging image-generative VLMs as object-centric world models
Proposes Reflection-through-Synthesis for iterative goal image validation and refinement
Achieves strong zero-shot generalization across diverse simulated and real-world manipulation tasks without fine-tuning
Outperforms baseline methods like MOKA, VoxPoser, and MolmoAct in success rates

Why it matters

Enables robust, data-free robotic manipulation across diverse environments and embodiments, advancing the practical deployment of autonomous robots in unstructured settings.

Abstract

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision- Language-Action (VLA) models build policies on top of Vision- Language Models (VLMs), seeking to transfer their open- world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction- vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal- VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further im- prove robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our Goal-VLA achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus-lins- lab.github.io/goalvlaweb/. * denotes equal contribution; † denotes the corresponding author. 1School of Computing, National University of Singapore; 2The HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong; 3Yuanpei College, Peking University; 4Department of Automation, Tsinghua University.

Index terms

Manipulation Planning AI-Enabled Robotics Failure Detection and Recovery