ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation
Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita
AI summary
Problem
Collecting diverse, action-labeled demonstrations for bimanual manipulation is costly and limits policy scalability, while existing augmentation methods fail to generate novel poses with corresponding actions and physical consistency for third-person RGB-D setups.
Approach
The method fine-tunes Stable Diffusion with ControlNet, conditioning it on a rendered skeleton pose of the target robot configuration to generate geometrically consistent RGB-D views and corresponding joint-space action labels while enforcing contact constraints.
Key results
- Outperforms ACT and VISTA baselines across 5 simulated bimanual tasks
- Generates geometrically consistent RGB-D image pairs with valid action labels
- Achieves higher success rates in 300 real-world bimanual trials
- Enables offline data augmentation without interactive simulation rollouts
Why it matters
It provides a scalable, low-cost pipeline for training robust bimanual manipulation policies by overcoming the third-person vision data bottleneck.
Abstract
Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad cov- erage over robot poses, contacts, and scene contexts. How- ever, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data aug- mentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint- space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to- object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to- hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.