DiffuDepGrasp: Diffusion-Based Depth Noise Modeling Empowers Sim-To-Real Robotic Grasping
Yingting Zhou, Wenbo Cui, Weiheng Liu, Guixing Chen, Haoran Li, Dongbin Zhao
AI summary
Problem
Transferring depth-based grasping policies from simulation to reality is hindered by sensor artifacts like voids and noise, while existing solutions suffer from data inefficiency or add computational latency during deployment.
Approach
DiffuDepGrasp trains a conditional diffusion model on minimal real RGB-D data to learn sensor noise patterns, then grafts these artifacts onto pristine simulation depth maps to train a student policy via imitation learning.
Key results
- 95.7% average success rate on 12-object grasping
- Zero-shot sim2real transfer without real-world training
- Eliminates deployment computational overhead
- Learns complex sensor noise from minimal unpaired RGB-D data
Why it matters
Enables robust, real-time robotic manipulation by bridging the sim2real perception gap efficiently without compromising control latency or requiring extensive paired datasets.
Abstract
Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipu- lation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy- efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects. Project website: https://diffudepgrasp.github.io/.