← Back ICRA 2026

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning Via Normalizing Flows

Shaswat Garg, Matin Moezzi, Brandon Da Silva

PDF

AI summary

Key figure (auto-extracted from paper)

Replacing Gaussian policies with normalizing flows at both hierarchical levels dramatically improves data efficiency and robustness in offline goal-conditioned reinforcement learning.

Normalizing flows hierarchical RL goal-conditioned RL offline reinforcement learning data efficiency policy expressivity

Problem

Hierarchical goal-conditioned RL suffers from poor data efficiency and limited policy expressivity, making it difficult to train effectively in offline or data-scarce regimes where collecting diverse datasets is infeasible.

Approach

NF-HIQL replaces unimodal Gaussian policies with RealNVP normalizing flows at both high- and low-levels, enabling exact likelihood computation, multimodal behavior modeling, and efficient single-pass sampling.

Key results

Outperforms diffusion and Gaussian baselines on OGBench navigation and manipulation tasks
Maintains 73% success rate with 50% data while baselines collapse to 1-30%
Provides theoretical KL-divergence bounds and PAC-style sample efficiency guarantees
Achieves 100% task success on a real 6-DOF robot arm with limited offline samples

Why it matters

Enables scalable, data-efficient hierarchical decision-making for robotics and long-horizon control tasks where data collection is costly or unsafe.

Abstract

Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF- HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

Index terms

Autonomous Agents Learning from Demonstration Reinforcement Learning