← Back ICRA 2026

VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation

Jiaming Chen, Yiyu Jiang, Aoshen Huang, Yang Li, Wei Pan

PDF

AI summary

Key figure (auto-extracted from paper)

The VLM-SFD framework enables dual-arm robots to learn and execute complex cooperative tasks from just ten demonstrations, requiring no real-world fine-tuning.

Dual-arm manipulation Diffusion models Vision-language models Imitation learning Motion planning Siamese networks

Problem

Current dual-arm manipulation methods struggle to generalize across diverse tasks and unstructured environments, typically requiring extensive demonstration data and real-world fine-tuning to achieve reliable coordination.

Approach

The framework employs a Siamese diffusion network to generate synchronized, language-conditioned motion flows for two objects, then uses a vision-language model to dynamically assign and schedule these flows to each arm for collision-free execution.

Key results

First diffusion-based framework for dual-arm motion synthesis from limited demonstrations
VLM-assisted spatial-temporal task allocation for dynamic trajectory assignment
Competitive success rates on four real-world tasks using only ten demonstrations each
Direct real-world deployment without additional fine-tuning or real-world data collection

Why it matters

It drastically reduces data and tuning requirements for dual-arm robotics, enabling rapid and reliable deployment of complex bimanual skills in practical, unstructured environments.

Abstract

Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seam- less coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we intro- duce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits out- standing adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a min- imal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual- encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process—conditioned by task instructions—generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates apre-trainedvision-languagemodel(VLM)toadaptivelyassignthe optimalmotiontoeachroboticarmovertime.Experimentsvalidate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability.

Index terms

Dual Arm Manipulation Visual Learning Manipulation Planning