Research Analyzer
← Back ICRA 2026

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Huy Le, Tai Hoang, Miroslav Gabriel, Gerhard Neumann, Ngo Anh Vien

PDF

AI summary

Key figure (auto-extracted from paper)
HyDo significantly improves exploration and success rates in non-prehensile manipulation by combining diffusion models for continuous actions with maximum entropy optimization in a hybrid RL framework.
Reinforcement Learning Diffusion Policies Hybrid Action Spaces Non-Prehensile Manipulation Maximum Entropy Sim2Real Transfer

Problem

Learning diverse and robust policies for non-prehensile manipulation is challenging due to complex hybrid action spaces (discrete contact points and continuous motion parameters) and limited exploration strategies in existing methods.

Approach

The authors propose HyDo, a hybrid off-policy RL algorithm that uses diffusion models to parameterize continuous motion policies and integrates maximum entropy regularization to encourage diverse exploration across both discrete and continuous action spaces.

Key results

  • Hybrid RL framework combining diffusion policies with maximum entropy optimization
  • Theoretical justification via structured variational inference for the lower-bound objective
  • Significantly improved zero-shot sim2real success rates (53% to 72% on 6D pose alignment)
  • Enhanced behavior diversity and generalization across simulated and real-world tasks

Why it matters

It advances robot dexterity by enabling more robust and generalizable manipulation skills that can transfer effectively from simulation to real-world hardware.

Abstract

Learning diverse policies for non-prehensile manip- ulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance explo- ration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffu- sion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real- world 6D pose alignment task. Project page is available at https://leh2rng.github.io/hydo

Index terms

Reinforcement Learning Machine Learning for Robot Control Dexterous Manipulation

Related papers