← Back ICRA 2026

SilRef: Joint Visual Silhouette and Tactile Pose Optimization for Transparent Object Manipulation

Saifeddine Aloui, Mathieu Grossard, Markus Vincze and Andreas Holzinger

PDF

AI summary

Key figure (auto-extracted from paper)

SilRef achieves robust 6D pose refinement for transparent objects by fusing silhouette matching with tactile or surface cues, outperforming depth-based and render-and-compare methods without requiring accurate depth or complex rendering.

transparent object pose estimation silhouette optimization tactile-visual fusion pose refinement laboratory automation gradient descent

Problem

Transparent objects break standard vision pipelines because they lack distinct features, distort backgrounds, and fail depth sensors, making reliable pose estimation critical for laboratory automation nearly impossible with current methods.

Approach

SilRef iteratively refines object poses via gradient descent by aligning detected silhouette rays with rendered 3D model points, supplemented by tactile contact constraints or supporting surface geometry to resolve depth ambiguities.

Key results

2.8x improvement in pose accuracy for standing objects on the Keypose dataset
2.7x improvement in pose accuracy for grasped objects on the new Tracebot In-Gripper dataset
Training-free optimization that eliminates the need for depth sensors or realistic transparent rendering
Release of the Tracebot In-Gripper dataset containing 608 grasped transparent object images with tactile sensor data

Why it matters

Enables reliable automation of liquid-handling and small-batch medication manufacturing in laboratory settings where transparent containers are ubiquitous.

Abstract

Transparent objects are ubiquitous in laboratory automation settings, as liquids need to be visually controlled regularly. Automating laboratory processes would make the creation of small-batch medication feasible, thus making more personalized and better-targeted treatments more accessible. However, transparent objects present a major challenge for robust vision systems, in turn compromising their manipulation. Their appearance varies depending on the environment and depth sensors fail to capture their measurements. These objects therefore break central assumptions made by depth-based as well as render-and-compare pose refinement strategies. To ensure reliable pose estimation, we propose Silhouette-based object pose Refinement (SilRef), a novel pose refinement approach leveraging object silhouette detection and geometric cues, circumventing the need for depth maps or realistic rendering making it robust to environment change. Our proposed formulation directly optimizes the poses by gradient descent based on 3D models rendering and benefits from a large convergence basin. SilRef is evaluated on the Keypose dataset and the newly collected Tracebot In-Gripper dataset. Results show an improvement of 2.8x and 2.7x in Average Distance of Model Points-Symmetric (ADD-S@0.01m) when the object is standing on a surface and when the object is already grasped, respectively, compared to Megapose6D and ICP (Iterative Closest Point).

Index terms

Perception for Grasping and Manipulation Deep Learning for Visual Perception Deep Learning in Grasping and Manipulation