Offline Reinforced Finetuning for Chunk-Based VLA Via Real-World RL Policy Distillation with Vision-Guided Copilot
human efforts.
AI summary
Problem
Pre-trained vision-language-action models struggle to maintain performance when adapted to new tasks, while supervised fine-tuning demands costly demonstrations and reinforcement learning faces sparse rewards, sim-to-real gaps, and heavy human intervention requirements.
Approach
The method integrates a diffusion-based Vision-guided Copilot to refine human teleoperation into expert actions during real-world RL, then distills the resulting reward-annotated trajectories into a chunk-based VLA using an offline RL algorithm.
Key results
- Vision-guided Copilot refines human teleoperation into expert actions via visual feedback
- CopRL enables sample-efficient real-world RL with minimal human intervention
- RaoRFT distills reward-annotated trajectories into chunk-based VLAs via offline RL
- Achieves state-of-the-art real-world manipulation performance with minimal human effort
Why it matters
Provides a practical, scalable pathway for deploying high-performance, adaptable vision-language-action models in complex real-world robotic manipulation tasks.
Abstract
Pre-trained VLA do not fully retain their strong performance when fine-tuned for new tasks, hindering robust deployment in new environments. This limitation primarily stem from the constraints of prevalent fine-tuning approaches: supervised fine-tuning (SFT) requires large amounts of high- quality demonstrations; reinforcement learning (RL) is often limited by dataset quality, sparse reward, and the sim-to-real gap. To overcome these limitations, we propose a novel frame- work that leverages sample-efficient real-world RL to collect data for offline distillation into the VLA. Our method introduces three key components: (1) Vision-guided Copilot that refines human actions toward expert-level action using visual feedback to improve intervention quality and data efficiency. (2) CopRL, a human-in-the-loop RL framework that leverages the Copilot for efficient online exploration and data collection with minimal human intervention; and (3) RaoRFT, an offline RL algorithm that distills high-quality reward-annotated trajectories from CopRL into the VLA. Real-world experiments show our method achieves state-of-the-art performance with minimal human input. Our work provides a practical and effective pathway for deploying high-performance VLA in complex manipulation tasks. Codes and models will be available.