← Back ICRA 2026

Offline Reinforced Finetuning for Chunk-Based VLA Via Real-World RL Policy Distillation with Vision-Guided Copilot

human efforts.

PDF

AI summary

Key figure (auto-extracted from paper)

A vision-guided copilot combined with offline RL distillation enables efficient, high-performance fine-tuning of vision-language-action models in real-world manipulation with minimal human effort.

Vision-Language-Action models Real-world reinforcement learning Offline RL fine-tuning Human-in-the-loop teleoperation Policy distillation Robotic manipulation

Problem

Pre-trained vision-language-action models struggle to maintain performance when adapted to new tasks, while supervised fine-tuning demands costly demonstrations and reinforcement learning faces sparse rewards, sim-to-real gaps, and heavy human intervention requirements.

Approach

The method integrates a diffusion-based Vision-guided Copilot to refine human teleoperation into expert actions during real-world RL, then distills the resulting reward-annotated trajectories into a chunk-based VLA using an offline RL algorithm.

Key results

Vision-guided Copilot refines human teleoperation into expert actions via visual feedback
CopRL enables sample-efficient real-world RL with minimal human intervention
RaoRFT distills reward-annotated trajectories into chunk-based VLAs via offline RL
Achieves state-of-the-art real-world manipulation performance with minimal human effort

Why it matters

Provides a practical, scalable pathway for deploying high-performance, adaptable vision-language-action models in complex real-world robotic manipulation tasks.

Abstract

Pre-trained VLA do not fully retain their strong performance when fine-tuned for new tasks, hindering robust deployment in new environments. This limitation primarily stem from the constraints of prevalent fine-tuning approaches: supervised fine-tuning (SFT) requires large amounts of high- quality demonstrations; reinforcement learning (RL) is often limited by dataset quality, sparse reward, and the sim-to-real gap. To overcome these limitations, we propose a novel frame- work that leverages sample-efficient real-world RL to collect data for offline distillation into the VLA. Our method introduces three key components: (1) Vision-guided Copilot that refines human actions toward expert-level action using visual feedback to improve intervention quality and data efficiency. (2) CopRL, a human-in-the-loop RL framework that leverages the Copilot for efficient online exploration and data collection with minimal human intervention; and (3) RaoRFT, an offline RL algorithm that distills high-quality reward-annotated trajectories from CopRL into the VLA. Real-world experiments show our method achieves state-of-the-art performance with minimal human input. Our work provides a practical and effective pathway for deploying high-performance VLA in complex manipulation tasks. Codes and models will be available.

Index terms

Reinforcement Learning