M-VTOP: Modular Visuo-Tactile Object Pose Estimation for High-Precision Robotic Manipulation
Miquel Oller, Qiyang Qian, Radu Corcodel, Siddarth Jain
AI summary
Problem
Current pose estimation methods struggle with occlusion, sensor noise, and poor generalization for small or intricate objects, often demanding extensive retraining or large annotated datasets.
Approach
The framework fuses vision, tactile, and contact signals using a belief-based particle filter and a geometry-focused mask representation, enabling robust, zero-shot pose refinement that relies solely on an object's CAD model.
Key results
- Sub-millimeter pose accuracy across diverse small objects and complex geometries
- Robust performance under occlusions, sensor noise, and missing modalities
- Zero-shot operation requiring only CAD models without task-specific retraining
- Up to 100% insertion success rate in simulation and real-world trials
Why it matters
Provides a reliable, retraining-free solution for high-precision robotic assembly and manipulation of small, complex components in industrial and service applications.
Abstract
Accurate object pose estimation is essential for robotic manipulation, particularly in tasks involving small or geometrically intricate objects where high precision is required. Existing vision, tactile, and hybrid-based approaches struggle with occlusion, noise, and limited generalization, often requiring extensive retraining or large annotated datasets. In this work, we present M-VTOP, a modular framework for in-hand object pose estimation that integrates vision, tactile, and contact sensing in a flexible manner, allowing robustness against noisy or missing modalities. At the core of the framework is a belief-based particle filter that fuses heterogeneous sensor ob- servations, maintains probabilistic estimates, and continuously refines them toward high-precision convergence in closed-loop robotic control with the pose estimation feedback. A mask- based observation representation unifies visual and tactile signals into geometry-centric inputs, enhancing robustness to texture and lighting variations while supporting zero-shot generalization. The framework requires only an object’s CAD model and avoids task-specific retraining. Experiments show that M-VTOP achieves sub-millimeter accuracy under complex geometries, occlusions, and tight tolerances, demonstrating its promise for high-precision robotic manipulation.