← Back ICRA 2026

From Dream to Action: Hierarchical Policy Learning with 3D World Imagination for Robotic Manipulation

Wenshuo Wang, Ruiteng Zhao, Tat Joo Teo, Marcelo H Ang Jr, Haiyue Zhu

PDF

AI summary

Key figure (auto-extracted from paper)

Coupling a triplane-based 3D world imagination module with a flow-based action policy significantly improves future state prediction accuracy and manipulation success rates over state-of-the-art baselines.

3D World Imagination Hierarchical Policy Learning Flow-Based Control Triplane Representation Robotic Manipulation Visuomotor Policies

Problem

Most visuomotor policies rely on 2D observations or directly map inputs to actions without anticipating interactive dynamics, limiting spatial reasoning and step-by-step control in complex tasks.

Approach

The authors propose a hierarchical framework that decouples 3D scene prediction from decision-making, using a triplane autoencoder to forecast intermediate future point clouds and a flow-based policy with adaptive guidance to generate real-time motor commands.

Key results

92% voxel IoU accuracy in future 3D state prediction
Up to 8% higher success rates than state-of-the-art baselines on Adroit and Meta-World benchmarks
Successful execution of long-horizon and contact-rich tasks in real-world evaluations
Plug-and-play triplane world model with adaptive Classifier-Free Guidance for efficient action generation

Why it matters

It establishes a scalable, spatially-aware paradigm for robotic manipulation that bridges high-level world reasoning with low-level real-time control, advancing embodied AI and automation.

Abstract

Recent advancements in robotics have focused on developing foundation models capable of generating both actions and future states. Typically, these policies leverage world models to depict human-like imagination. However, most methods remain confined to the 2D domain, where they forecast only the final outcome state rather than the evolving interaction process, thereby offering limited guidance for step-by-step control. To address these limitations, we propose a hierarchical framework that couples 3D imagination, 3D perception, and action generation. A triplane-based world model captures future scene dynamics in a computationally efficient manner, providing predictive cues for decision-making. Based on these representations, the action expert, implemented with a flow- based policy network, converts the outputs of 3D imagination and perception into executable commands. We further intro- duce an adaptive Classifier-Free Guidance strategy to balance action quality with condition adherence. On Adroit, Meta- World, and real-world tasks, our method achieves a 92% voxel IoU in future state prediction and up to 8% higher success rates than state-of-the-art baselines. The performance gains highlight the effectiveness and generalizability of our method in complex robotic manipulation.

Index terms

Imitation Learning Learning from Demonstration Deep Learning in Grasping and Manipulation