Ego-Vision World Model for Humanoid Contact Planning
Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath
AI summary
Problem
Humanoid robots struggle to exploit physical contact in unstructured environments because traditional planners fail under contact complexity and on-policy reinforcement learning is sample-inefficient and poorly generalizes across tasks.
Approach
The authors train a scalable visual world model on a random offline dataset to predict future outcomes in a compressed latent space, then use it within a sampling-based MPC framework guided by a learned surrogate value function for robust, real-time planning.
Key results
- Trained a scalable visual world model entirely from a demonstration-free offline dataset
- Introduced a value-guided sampling MPC framework for efficient action evaluation
- Demonstrated robust real-world contact planning (wall support, object blocking, arch traversal) on a physical humanoid using only ego-centric depth and proprioception
- Achieved improved sample efficiency and multi-task generalization compared to on-policy RL baselines
Why it matters
Provides a scalable, vision-based pathway for humanoids to safely exploit physical contact, accelerating their deployment in complex, unstructured real-world environments.
Abstract
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for auton- omy in unstructured environments. Traditional optimization- based planners struggle with contact complexity, while on- policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust plan- ning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/.