Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-To-Real Manipulation
Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager
AI summary
Problem
Deploying simulation-trained robotic policies to the real world remains challenging due to the sim-to-real gap and varying object dynamics that domain randomization handles poorly by defaulting to averaged behaviors.
Approach
Phys2Real trains reinforcement learning policies conditioned on interpretable physical parameters and fuses vision-language model estimates with online interaction data using inverse-variance weighting based on quantified uncertainty.
Key results
- 100% success rate on bottom-weighted T-block pushing versus 79% baseline
- 57% success rate on challenging top-weighted T-block versus 23% baseline
- 15% faster task completion for hammer pushing
- Ablation confirms VLM and interaction fusion is essential for success
Why it matters
Provides a scalable pathway for robots to adapt to novel object dynamics in real-world manipulation without costly real-world training.
Abstract
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to- sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, re- fining VLM predictions with online estimates via ensemble- based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https: //phys2real.github.io/ This work is in part supported by ONR N00014-23-1-2355, ONR MURI N00014-22-1-2740, ONR MURI N00014-24-1-2748, NSF RI #2338203, and NSF FRR grant 2342246. M. Wang is supported by the NASA NSTGRO Fellowship and NSF grant 2342246. S. Tian and A. Swann are supported by NSF GRFP Grant No. DGE-1656518 and DGE-2146755, respectively. 1Stanford University, Stanford, CA, USA. 2Princeton University, Princeton, NJ, USA.