← Back ICRA 2026

ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

PDF

AI summary

Key figure (auto-extracted from paper)

ROPA scales bimanual imitation learning by using a pose-guided diffusion model to synthesize consistent third-person RGB-D images and valid action labels from limited demonstrations.

bimanual manipulation data augmentation diffusion models imitation learning RGB-D synthesis robot pose generation

Problem

Collecting diverse, action-labeled demonstrations for bimanual manipulation is costly and limits policy scalability, while existing augmentation methods fail to generate novel poses with corresponding actions and physical consistency for third-person RGB-D setups.

Approach

The method fine-tunes Stable Diffusion with ControlNet, conditioning it on a rendered skeleton pose of the target robot configuration to generate geometrically consistent RGB-D views and corresponding joint-space action labels while enforcing contact constraints.

Key results

Outperforms ACT and VISTA baselines across 5 simulated bimanual tasks
Generates geometrically consistent RGB-D image pairs with valid action labels
Achieves higher success rates in 300 real-world bimanual trials
Enables offline data augmentation without interactive simulation rollouts

Why it matters

It provides a scalable, low-cost pipeline for training robust bimanual manipulation policies by overcoming the third-person vision data bottleneck.

Abstract

Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad cov- erage over robot poses, contacts, and scene contexts. How- ever, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data aug- mentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint- space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to- object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to- hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Index terms

Data Sets for Robot Learning Imitation Learning Bimanual Manipulation