← Back ICRA 2026

Best of Sim and Real: Decoupled Visuomotor Manipulation Via Learning Control in Simulation and Perception in Real

Jialei Huang, Zhao-Heng Yin, Yingdong Hu, Shuo Wang, Xingyu Lin, Yang Gao

PDF

AI summary

Key figure (auto-extracted from paper)

Decoupling control and perception enables robust sim-to-real robot manipulation with only 10–20 real demonstrations, drastically improving data efficiency and out-of-distribution generalization.

Sim-to-real transfer decoupled learning visuomotor control privileged state visual bridge data efficiency

Problem

Sim-to-real transfer in robot manipulation is hindered by the entanglement of perception and control in end-to-end learning, which forces networks to simultaneously handle visual domain shifts and physical dynamics gaps, requiring extensive real-world data.

Approach

The method trains a control policy using perfect state information in simulation, then freezes it and learns a lightweight visual bridge in the real world to map camera images to the policy's expected input space.

Key results

Achieves 73–88% success with only 10–20 real demonstrations per task
Outperforms end-to-end baselines by 30–50 percentage points in data efficiency
Maintains graceful performance degradation when generalizing to out-of-distribution object positions and scales
Visual bridge and pretrained vision encoder are critical for few-shot real-world adaptation

Why it matters

Enables practical, data-efficient deployment of robot manipulation policies in new environments without relying on costly, large-scale real-world training data.

Abstract

Sim-to-real transfer remains a fundamental chal- lenge in robot manipulation due to the entanglement of percep- tion and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real ob- servations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across envi- ronments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim- to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.

Index terms

Dexterous Manipulation Simulation and Animation