Best of Sim and Real: Decoupled Visuomotor Manipulation Via Learning Control in Simulation and Perception in Real
Jialei Huang, Zhao-Heng Yin, Yingdong Hu, Shuo Wang, Xingyu Lin, Yang Gao
AI summary
Problem
Sim-to-real transfer in robot manipulation is hindered by the entanglement of perception and control in end-to-end learning, which forces networks to simultaneously handle visual domain shifts and physical dynamics gaps, requiring extensive real-world data.
Approach
The method trains a control policy using perfect state information in simulation, then freezes it and learns a lightweight visual bridge in the real world to map camera images to the policy's expected input space.
Key results
- Achieves 73–88% success with only 10–20 real demonstrations per task
- Outperforms end-to-end baselines by 30–50 percentage points in data efficiency
- Maintains graceful performance degradation when generalizing to out-of-distribution object positions and scales
- Visual bridge and pretrained vision encoder are critical for few-shot real-world adaptation
Why it matters
Enables practical, data-efficient deployment of robot manipulation policies in new environments without relying on costly, large-scale real-world training data.
Abstract
Sim-to-real transfer remains a fundamental chal- lenge in robot manipulation due to the entanglement of percep- tion and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real ob- servations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across envi- ronments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim- to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.