← Back ICRA 2026

Ego-Vision World Model for Humanoid Contact Planning

Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath

PDF

AI summary

Key figure (auto-extracted from paper)

A demonstration-free visual world model combined with value-guided MPC enables robust, multi-task contact planning for humanoids with superior sample efficiency over on-policy RL.

humanoid robotics contact planning world models model predictive control reinforcement learning vision-based control

Problem

Humanoid robots struggle to exploit physical contact in unstructured environments because traditional planners fail under contact complexity and on-policy reinforcement learning is sample-inefficient and poorly generalizes across tasks.

Approach

The authors train a scalable visual world model on a random offline dataset to predict future outcomes in a compressed latent space, then use it within a sampling-based MPC framework guided by a learned surrogate value function for robust, real-time planning.

Key results

Trained a scalable visual world model entirely from a demonstration-free offline dataset
Introduced a value-guided sampling MPC framework for efficient action evaluation
Demonstrated robust real-world contact planning (wall support, object blocking, arch traversal) on a physical humanoid using only ego-centric depth and proprioception
Achieved improved sample efficiency and multi-task generalization compared to on-policy RL baselines

Why it matters

Provides a scalable, vision-based pathway for humanoids to safely exploit physical contact, accelerating their deployment in complex, unstructured real-world environments.

Abstract

Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for auton- omy in unstructured environments. Traditional optimization- based planners struggle with contact complexity, while on- policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust plan- ning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/.

Index terms

Multi-Contact Whole-Body Motion Planning and Control Integrated Planning and Learning Deep Learning for Visual Perception