← Back ICRA 2026

TORM: Transparent Objects Reconstruction and Manipulation with Multi-View Segmentation

Qiyuan Qiao, Fuling Lin, Huibin Zhao, Bowen Xu, Zhiqiang Chen, Dong Xu, Peng Lu

PDF

AI summary

Key figure (auto-extracted from paper)

TORM achieves an 88.8% real-world grasping success rate for multiple transparent objects by combining multi-view silhouette constraints with a novel deformable tetrahedral mesh optimization.

Transparent object reconstruction multi-view segmentation robotic grasping deformable tetrahedral mesh 3D perception grasp planning

Problem

Accurately reconstructing and grasping multiple transparent objects is hindered by complex light refraction, lack of distinct visual features, and the tendency of existing methods to merge objects or get trapped in suboptimal geometric solutions.

Approach

The method extracts multi-view semantic silhouettes to constrain a self-supervised deep marching tetrahedra network, then applies a progressive loss and connectivity-based mesh separation to reconstruct individual objects and predict grasp poses in parallel.

Key results

Simultaneous multi-object reconstruction via DMTet-Multi
Progressive envelope loss prevents local minima trapping
Connectivity-based mesh separation for parallel grasp planning
88.8% real-world grasping success rate

Why it matters

Enables reliable RGB-only perception and manipulation of transparent items, advancing robotic automation for laboratory, household, and AR applications.

Abstract

Transparent objects are common in daily life and industry, necessitating that robots be able to perceive and manipulate them. The physical properties of reflection and refraction pose challenges for accurately reconstructing the 3D geometry of transparent objects. Conventional methods, which rely on simultaneous estimation of background ambient light and complex refraction fields, lack robustness in real-world scenes, thereby impeding robotic grasping performance. To address this issue, this paper proposes TORM, a novel framework for robust reconstruction and manipulation of multiple transparent objects. TORM focuses on semantic information from transparent objects and employs multi-view segmentation masks to constrain a self-supervised multi-object deep marching tetrahedra (DMTet- Multi) 3D fitting process. To mitigate the risk of the geometry representation getting stuck in suboptimal solutions during multi- transparent-object reconstruction, we design a novel loss function that prevents marching tetrahedra from crossing boundaries. By applying a connectivity determination strategy to the fitted mesh, transparent objects can be processed in parallel by a grasp perception network, predicting the end-effector configuration for grasp tasks. Real-world experiments demonstrate that TORM achieves an 88.8% grasping success rate in multi-transparent- object grasping tasks.

Index terms

Perception for Grasping and Manipulation Computer Vision for Automation Deep Learning for Visual Perception