← Back ICRA 2026

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Mingyang Lv, Yinqian Sun, Erliang Lin, huangrui li, Ruolin Chen, Feifei Zhao, Yi Zeng

PDF

AI summary

Key figure (auto-extracted from paper)

FPO enables stable online reinforcement fine-tuning of flow-matching VLA models by bypassing intractable likelihood calculations, significantly boosting robotic task success rates beyond imitation learning limits.

Flow-matching Vision-Language-Action Reinforcement Learning Policy Optimization Robotic Control Online Fine-tuning

Problem

Conventional policy gradient RL methods are computationally infeasible for flow-matching Vision-Language-Action models because calculating importance sampling ratios requires solving intractable ordinary differential equations and Jacobian traces. This prevents effective online fine-tuning to overcome the performance ceiling of supervised imitation data.

Approach

FPO replaces intractable likelihood ratios with a likelihood-free proxy derived from per-sample changes in the conditional flow-matching objective, integrated with structure-aware credit assignment, clipped surrogate objectives, multi-step latent exploration, and a Q-ensemble for stable online updates.

Key results

Achieves 87.2% average success rate on LIBERO benchmark
Surpasses supervised, diffusion-based, and autoregressive RL baselines
Enables stable online learning under sparse rewards and contact-rich dynamics
Validates individual component contributions via ablation studies

Why it matters

It unlocks practical online reinforcement learning for flow-matching VLA models, enabling robots to continuously improve and generalize beyond static demonstration data without prohibitive computational costs.

Abstract

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and π0 have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through on- line interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the π0 model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference- aligned, diffusion-based, autoregressive online RL, and π0-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

Index terms

Learning from Demonstration Imitation Learning Reinforcement Learning