← Back ICRA 2026

DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

Ce Hao, Kelvin Lin, Zhiwei Xue, Siyuan Luo, Harold Soh

PDF

AI summary

Key figure (auto-extracted from paper)

DISCO enables robust zero-shot robot manipulation from novel language instructions by combining off-the-shelf vision-language keyframes with constrained diffusion inpainting.

Language-conditioned manipulation Diffusion policies Vision-language models Constrained inpainting Zero-shot transfer Open-vocabulary robotics

Problem

Fine-tuned language-conditioned diffusion policies struggle to generalize to unseen, open-vocabulary instructions due to data scarcity, while direct VLM trajectory generation often produces unreliable or infeasible motions.

Approach

The framework uses off-the-shelf vision-language models to extract coarse 3D keyframes from language prompts, which then guide a diffusion policy via constrained inpainting that balances keyframe adherence with learned motion priors.

Key results

VLM-generated keyframes reliably guide diffusion-based action generation
Constrained inpainting optimization mitigates failures from inaccurate or out-of-distribution keyframes
Superior generalization on unseen and open-vocabulary tasks compared to fine-tuned baselines in simulation
Successful zero-shot transfer to real-world language-guided grasping without fine-tuning

Why it matters

Provides a scalable, data-efficient pathway for robots to execute novel language commands in real-world environments without task-specific training.

Abstract

Diffusion policies have demonstrated strong per- formance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization strategy that balances keyframe adherence with learned motion priors from training data. Experimental results in both simulated and real-world settings demonstrate that DISCO outperforms conventional fine-tuned language-conditioned policies, achieving superior generalization in zero-shot, open-vocabulary manipula- tion tasks. Videos see website: sites.google.com/view/disco2025.

Index terms

Imitation Learning Machine Learning for Robot Control