From Dream to Action: Hierarchical Policy Learning with 3D World Imagination for Robotic Manipulation
Wenshuo Wang, Ruiteng Zhao, Tat Joo Teo, Marcelo H Ang Jr, Haiyue Zhu
AI summary
Problem
Most visuomotor policies rely on 2D observations or directly map inputs to actions without anticipating interactive dynamics, limiting spatial reasoning and step-by-step control in complex tasks.
Approach
The authors propose a hierarchical framework that decouples 3D scene prediction from decision-making, using a triplane autoencoder to forecast intermediate future point clouds and a flow-based policy with adaptive guidance to generate real-time motor commands.
Key results
- 92% voxel IoU accuracy in future 3D state prediction
- Up to 8% higher success rates than state-of-the-art baselines on Adroit and Meta-World benchmarks
- Successful execution of long-horizon and contact-rich tasks in real-world evaluations
- Plug-and-play triplane world model with adaptive Classifier-Free Guidance for efficient action generation
Why it matters
It establishes a scalable, spatially-aware paradigm for robotic manipulation that bridges high-level world reasoning with low-level real-time control, advancing embodied AI and automation.
Abstract
Recent advancements in robotics have focused on developing foundation models capable of generating both actions and future states. Typically, these policies leverage world models to depict human-like imagination. However, most methods remain confined to the 2D domain, where they forecast only the final outcome state rather than the evolving interaction process, thereby offering limited guidance for step-by-step control. To address these limitations, we propose a hierarchical framework that couples 3D imagination, 3D perception, and action generation. A triplane-based world model captures future scene dynamics in a computationally efficient manner, providing predictive cues for decision-making. Based on these representations, the action expert, implemented with a flow- based policy network, converts the outputs of 3D imagination and perception into executable commands. We further intro- duce an adaptive Classifier-Free Guidance strategy to balance action quality with condition adherence. On Adroit, Meta- World, and real-world tasks, our method achieves a 92% voxel IoU in future state prediction and up to 8% higher success rates than state-of-the-art baselines. The performance gains highlight the effectiveness and generalizability of our method in complex robotic manipulation.