← Back ICRA 2026

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Huy Le, Tai Hoang, Miroslav Gabriel, Gerhard Neumann, Ngo Anh Vien

PDF

AI summary

Key figure (auto-extracted from paper)

HyDo significantly improves exploration and success rates in non-prehensile manipulation by combining diffusion models for continuous actions with maximum entropy optimization in a hybrid RL framework.

Reinforcement Learning Diffusion Policies Hybrid Action Spaces Non-Prehensile Manipulation Maximum Entropy Sim2Real Transfer

Problem

Learning diverse and robust policies for non-prehensile manipulation is challenging due to complex hybrid action spaces (discrete contact points and continuous motion parameters) and limited exploration strategies in existing methods.

Approach

The authors propose HyDo, a hybrid off-policy RL algorithm that uses diffusion models to parameterize continuous motion policies and integrates maximum entropy regularization to encourage diverse exploration across both discrete and continuous action spaces.

Key results

Hybrid RL framework combining diffusion policies with maximum entropy optimization
Theoretical justification via structured variational inference for the lower-bound objective
Significantly improved zero-shot sim2real success rates (53% to 72% on 6D pose alignment)
Enhanced behavior diversity and generalization across simulated and real-world tasks

Why it matters

It advances robot dexterity by enabling more robust and generalizable manipulation skills that can transfer effectively from simulation to real-world hardware.

Abstract

Learning diverse policies for non-prehensile manip- ulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance explo- ration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffu- sion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real- world 6D pose alignment task. Project page is available at https://leh2rng.github.io/hydo

Index terms

Reinforcement Learning Machine Learning for Robot Control Dexterous Manipulation