← Back ICRA 2026

VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation

Xuanran Zhai, Qianyou Zhao, Qiaojun Yu, Ce Hao

PDF

AI summary

Key figure (auto-extracted from paper)

VFP eliminates the averaging effect of standard flow-matching policies using variational latents and optimal transport, boosting simulation success rates by 49% while preserving fast inference.

Flow matching Multi-modal imitation learning Robot manipulation Variational inference Optimal transport Mixture-of-experts

Problem

Flow-matching policies accelerate robot action sampling but struggle with multi-modal distributions, often collapsing to averaged or ambiguous behaviors that fail in complex manipulation tasks.

Approach

VFP uses a variational latent prior to identify distinct action modes, applies Kantorovich Optimal Transport to align predicted and expert distributions, and leverages a Mixture-of-Experts decoder for specialized, efficient sampling.

Key results

49% average success rate improvement over flow-matching baselines in simulation
Higher success counts on 3 real-robot tasks than DP and FlowPolicy
Effective modeling of both task-level and path-level multi-modality
Retains fast single-step ODE inference with a compact model size

Why it matters

Enables reliable, real-time multi-modal robot manipulation for applications requiring diverse, collision-free, or context-dependent behaviors.

Abstract

Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K- OT) for distribution-level alignment and utilizes a Mixture-of- Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: https: //sites.google.com/view/varfp/

Index terms

Deep Learning in Grasping and Manipulation Dual Arm Manipulation Dexterous Manipulation