← Back ICRA 2026

Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

PDF

AI summary

Key figure (auto-extracted from paper)

MANGO successfully translates diverse simulated camera views into realistic unseen real-world perspectives, drastically improving robot policy robustness without requiring multi-view real-world data.

Sim2real translation Robot manipulation Viewpoint robustness GAN-based augmentation Unpaired image translation Robot policy learning

Problem

Vision-based robot policies trained on fixed-camera datasets fail when camera viewpoints shift during deployment, and collecting diverse real-world demonstration data is prohibitively scarce and labor-intensive.

Approach

MANGO uses a GAN-based unpaired image translation framework with a segmentation-conditioned InfoNCE loss, a modified PatchNCE loss, and a highly-regularized patch discriminator to preserve viewpoint consistency while translating simulated observations to realistic real-world styles.

Key results

Achieves state-of-the-art FID scores on randomized and wrist-view sim2real translation benchmarks.
Boosts shifted-view success rates by over 40 percentage points in real-world tabletop manipulation tasks.
Operates approximately 2,700x faster than diffusion-based augmentation methods.
Proves that segmentation-guided contrastive learning and patch regularization are essential for viewpoint preservation.

Why it matters

Provides robot learning practitioners with a computationally efficient pipeline to train viewpoint-robust policies using only cheap fixed-camera data and simulation, eliminating the bottleneck of multi-view real-world data collection.

Abstract

Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate vari- ation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real transla- tion. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real- world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation. For more results, visit: https://www.jeremiahcoholich.com/mango.

Index terms

Deep Learning in Grasping and Manipulation Imitation Learning Deep Learning for Visual Perception