Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation
Huy Le, Tai Hoang, Miroslav Gabriel, Gerhard Neumann, Ngo Anh Vien
AI summary
Problem
Learning diverse and robust policies for non-prehensile manipulation is challenging due to complex hybrid action spaces (discrete contact points and continuous motion parameters) and limited exploration strategies in existing methods.
Approach
The authors propose HyDo, a hybrid off-policy RL algorithm that uses diffusion models to parameterize continuous motion policies and integrates maximum entropy regularization to encourage diverse exploration across both discrete and continuous action spaces.
Key results
- Hybrid RL framework combining diffusion policies with maximum entropy optimization
- Theoretical justification via structured variational inference for the lower-bound objective
- Significantly improved zero-shot sim2real success rates (53% to 72% on 6D pose alignment)
- Enhanced behavior diversity and generalization across simulated and real-world tasks
Why it matters
It advances robot dexterity by enabling more robust and generalizable manipulation skills that can transfer effectively from simulation to real-world hardware.
Abstract
Learning diverse policies for non-prehensile manip- ulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance explo- ration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffu- sion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real- world 6D pose alignment task. Project page is available at https://leh2rng.github.io/hydo