← Back ICRA 2026

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Shen Licheng, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Repurposing pre-trained video diffusion models via lightweight LoRA adapters enables zero-shot, temporally consistent depth and normal estimation for transparent objects without real-world labels.

transparent depth estimation video diffusion LoRA fine-tuning synthetic data robotic perception normal estimation

Problem

Transparent and reflective objects break standard depth-sensing assumptions, causing missing regions and temporal instability that hinder robotic manipulation. Existing data-driven methods lack diversity and struggle with generalization to real-world transparent scenes.

Approach

The authors generate TransPhy3D, a large synthetic video dataset of transparent/reflective scenes, and repurpose a large video diffusion model into a video-to-video depth estimator using LoRA fine-tuning and co-training on mixed synthetic data.

Key results

Introduced TransPhy3D, an 11k-video synthetic dataset of transparent/reflective scenes
Achieved zero-shot state-of-the-art depth accuracy and temporal consistency on ClearPose, DREDS, and TransPhy3D-Test
Set new video normal estimation SOTA with the DKT-Normal variant
Boosted robotic grasping success rates across translucent, reflective, and diffuse surfaces

Why it matters

Enables robust, label-free 3D perception for robotics and computer vision, overcoming long-standing physical ambiguities of transparent materials in dynamic environments.

Abstract

Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely dis- criminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transpar- ent/reflective scenes: 11k sequences (1.32M frames) rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co- train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consis- tency over strong image/video baselines (e.g., Depth-Anything- v2, DepthCrafter), and a normal variant (DKT-Normal) sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at 0.17 s/frame (832×480). Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation. Code and models are available at https://daniellli. github.io/projects/DKT/.

Index terms

Perception for Grasping and Manipulation Deep Learning in Grasping and Manipulation Data Sets for Robotic Vision