SilRef: Joint Visual Silhouette and Tactile Pose Optimization for Transparent Object Manipulation
Saifeddine Aloui, Mathieu Grossard, Markus Vincze and Andreas Holzinger
AI summary
Problem
Transparent objects break standard vision pipelines because they lack distinct features, distort backgrounds, and fail depth sensors, making reliable pose estimation critical for laboratory automation nearly impossible with current methods.
Approach
SilRef iteratively refines object poses via gradient descent by aligning detected silhouette rays with rendered 3D model points, supplemented by tactile contact constraints or supporting surface geometry to resolve depth ambiguities.
Key results
- 2.8x improvement in pose accuracy for standing objects on the Keypose dataset
- 2.7x improvement in pose accuracy for grasped objects on the new Tracebot In-Gripper dataset
- Training-free optimization that eliminates the need for depth sensors or realistic transparent rendering
- Release of the Tracebot In-Gripper dataset containing 608 grasped transparent object images with tactile sensor data
Why it matters
Enables reliable automation of liquid-handling and small-batch medication manufacturing in laboratory settings where transparent containers are ubiquitous.
Abstract
Transparent objects are ubiquitous in laboratory automation settings, as liquids need to be visually controlled regularly. Automating laboratory processes would make the creation of small-batch medication feasible, thus making more personalized and better-targeted treatments more accessible. However, transparent objects present a major challenge for robust vision systems, in turn compromising their manipulation. Their appearance varies depending on the environment and depth sensors fail to capture their measurements. These objects therefore break central assumptions made by depth-based as well as render-and-compare pose refinement strategies. To ensure reliable pose estimation, we propose Silhouette-based object pose Refinement (SilRef), a novel pose refinement approach leveraging object silhouette detection and geometric cues, circumventing the need for depth maps or realistic rendering making it robust to environment change. Our proposed formulation directly optimizes the poses by gradient descent based on 3D models rendering and benefits from a large convergence basin. SilRef is evaluated on the Keypose dataset and the newly collected Tracebot In-Gripper dataset. Results show an improvement of 2.8x and 2.7x in Average Distance of Model Points-Symmetric (ADD-S@0.01m) when the object is standing on a surface and when the object is already grasped, respectively, compared to Megapose6D and ICP (Iterative Closest Point).