← Back SII 2026

Rewarding Change Beyond State: Directional VLM Rewards for Sample-Efficient Robot Reinforcement Learning

Linus, Christoffer Lundgren, Wenhao Lu, Zhitao Liang, Ze Zhang, Karinne Ramirez-Amaro, Emmanuel Dean

PDF

Abstract

Sparse rewards are a persistent bottleneck for robotic manipulation with Reinforcement Learning (RL), pri- marily because RL agents must discover long-horizon, multi- step behaviors while receiving infrequent and weakly infor- mative feedback. Recent work uses pre-trained Vision Lan- guage Models (VLMs) to provide dense per-step rewards, yet most approaches score only a single image against a goal text, ignoring whether the recent change actually moves the system toward success. We argue that this omission impairs exploration (e.g., goal-like detours, wrong-way progress, action aliasing) and propose to make time explicit in VLM rewards by adding a directional signal that evaluates short-horizon change. Concretely, we pair visual change over a few steps with a text description of the desired change, and finetune lightweight heads with RL; the resulting directional signal is combined with a standard positional signal into a single shaping reward. We evaluated our approach in six MetaWorld manipulation tasks with fixed goals. This directional shaping improves running average success at a fixed budget to 78.2%, versus 63.8% for the best-tuned positional baseline (improvements were observed in five of six tasks). Ablations identify key design choices for the proposed directional term to be effective and show its synergy with the positional term when supplying dense VLM rewards, demonstrating improved exploration and sample efficiency.

Index terms

Machine Learning Robotics