← Back ICRA 2026

DiffuDepGrasp: Diffusion-Based Depth Noise Modeling Empowers Sim-To-Real Robotic Grasping

Yingting Zhou, Wenbo Cui, Weiheng Liu, Guixing Chen, Haoran Li, Dongbin Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

A diffusion-based depth noise generator enables zero-shot sim2real transfer for robotic grasping, achieving 95.7% success without real-world training or deployment overhead.

Sim2Real Robotic Grasping Diffusion Models Depth Noise Imitation Learning Zero-Shot Transfer

Problem

Transferring depth-based grasping policies from simulation to reality is hindered by sensor artifacts like voids and noise, while existing solutions suffer from data inefficiency or add computational latency during deployment.

Approach

DiffuDepGrasp trains a conditional diffusion model on minimal real RGB-D data to learn sensor noise patterns, then grafts these artifacts onto pristine simulation depth maps to train a student policy via imitation learning.

Key results

95.7% average success rate on 12-object grasping
Zero-shot sim2real transfer without real-world training
Eliminates deployment computational overhead
Learns complex sensor noise from minimal unpaired RGB-D data

Why it matters

Enables robust, real-time robotic manipulation by bridging the sim2real perception gap efficiently without compromising control latency or requiring extensive paired datasets.

Abstract

Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipu- lation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy- efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects. Project website: https://diffudepgrasp.github.io/.

Index terms

Reinforcement Learning Deep Learning in Grasping and Manipulation Grasping