← Back ICRA 2026

ClearDepth: Efficient Stereo Perception of Transparent Objects for Robotic Manipulation

Kaixin Bai, Huajian Zeng, Lei Zhang, Yiwen Liu, Hongli Xu, Zhaopeng Chen, Jianwei Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

ClearDepth leverages a vision transformer and structural feature fusion to accurately recover stereo depth for transparent objects, boosting robotic grasp success rates by over 18%.

Transparent object perception Stereo depth estimation Vision transformer Sim2Real Robotic grasping Synthetic dataset

Problem

Standard stereo sensors and matching algorithms fail on transparent objects due to light refraction and reflection, producing unreliable depth maps that hinder robotic manipulation.

Approach

The method uses a cascaded vision transformer to extract structural cues and a lightweight GRU-based post-fusion module to combine them with appearance features, trained on a physically realistic synthetic dataset to bridge the simulation-to-reality gap.

Key results

Outperforms state-of-the-art stereo matching methods in disparity accuracy on transparent objects
Increases real-world transparent object grasp success rate by at least 18%
Introduces SynClearDepth, a photo-realistic dataset with 14,091 stereo pairs and precise depth labels
Demonstrates strong Sim2Real generalization for cluttered indoor robotic manipulation

Why it matters

Provides a scalable, accurate perception pipeline that enables service and logistics robots to reliably handle transparent items in real-world environments.

Abstract

Transparent object depth perception remains a major challenge in robotics and logistics due to the limitations of standard 3D sensors in capturing accurate depth on transparent and reflective surfaces. This affects applications relying on depth maps and point clouds, particularly in robotic manipulation. To address this, we propose ClearDepth, a vision transformer-based algorithm for stereo depth recovery of transparent objects, enhanced by a novel feature post-fusion module that refines depth estimation using structural visual features. To mitigate the high costs of stereo dataset collection, we introduce a physically realistic, domain- adaptive Sim2Real framework for efficient data generation. Our method outperforms state-of-the-art stereo matching approaches on transparent depth recovery. Furthermore, in transparent object grasping experiments, ClearDepth improves transparent-scene perception and achieves at least an 18% higher grasp success rate compared to the state-of-the-art methods for transparent object manipulation. Our method demonstrates strong Sim2Real gener- alization, enabling precise depth perception of transparent objects for robotic applications in the real world. Dataset and project details are available at https://sites.google.com/view/cleardepth/.

Index terms

Deep Learning for Visual Perception Computer Vision for Automation Data Sets for Robotic Vision