Research Analyzer
← Back ICRA 2026

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-To-Real Manipulation

Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager

PDF

AI summary

Key figure (auto-extracted from paper)
Fusing vision-language model priors with online interaction data via uncertainty-aware fusion significantly improves sim-to-real robotic manipulation performance over traditional domain randomization.
Sim-to-real transfer Robotic manipulation Vision-language models Uncertainty-aware adaptation Reinforcement learning Digital twins

Problem

Deploying simulation-trained robotic policies to the real world remains challenging due to the sim-to-real gap and varying object dynamics that domain randomization handles poorly by defaulting to averaged behaviors.

Approach

Phys2Real trains reinforcement learning policies conditioned on interpretable physical parameters and fuses vision-language model estimates with online interaction data using inverse-variance weighting based on quantified uncertainty.

Key results

  • 100% success rate on bottom-weighted T-block pushing versus 79% baseline
  • 57% success rate on challenging top-weighted T-block versus 23% baseline
  • 15% faster task completion for hammer pushing
  • Ablation confirms VLM and interaction fusion is essential for success

Why it matters

Provides a scalable pathway for robots to adapt to novel object dynamics in real-world manipulation without costly real-world training.

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to- sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, re- fining VLM predictions with online estimates via ensemble- based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https: //phys2real.github.io/ This work is in part supported by ONR N00014-23-1-2355, ONR MURI N00014-22-1-2740, ONR MURI N00014-24-1-2748, NSF RI #2338203, and NSF FRR grant 2342246. M. Wang is supported by the NASA NSTGRO Fellowship and NSF grant 2342246. S. Tian and A. Swann are supported by NSF GRFP Grant No. DGE-1656518 and DGE-2146755, respectively. 1Stanford University, Stanford, CA, USA. 2Princeton University, Princeton, NJ, USA.

Index terms

Reinforcement Learning Machine Learning for Robot Control Deep Learning in Grasping and Manipulation

Related papers