← Back ICRA 2026

M-VTOP: Modular Visuo-Tactile Object Pose Estimation for High-Precision Robotic Manipulation

Miquel Oller, Qiyang Qian, Radu Corcodel, Siddarth Jain

PDF

AI summary

Key figure (auto-extracted from paper)

M-VTOP achieves sub-millimeter in-hand object pose estimation under complex geometries and occlusions by flexibly fusing vision and tactile data without requiring retraining.

Visuo-tactile sensing Object pose estimation Particle filter Zero-shot learning Robotic manipulation CAD-based alignment

Problem

Current pose estimation methods struggle with occlusion, sensor noise, and poor generalization for small or intricate objects, often demanding extensive retraining or large annotated datasets.

Approach

The framework fuses vision, tactile, and contact signals using a belief-based particle filter and a geometry-focused mask representation, enabling robust, zero-shot pose refinement that relies solely on an object's CAD model.

Key results

Sub-millimeter pose accuracy across diverse small objects and complex geometries
Robust performance under occlusions, sensor noise, and missing modalities
Zero-shot operation requiring only CAD models without task-specific retraining
Up to 100% insertion success rate in simulation and real-world trials

Why it matters

Provides a reliable, retraining-free solution for high-precision robotic assembly and manipulation of small, complex components in industrial and service applications.

Abstract

Accurate object pose estimation is essential for robotic manipulation, particularly in tasks involving small or geometrically intricate objects where high precision is required. Existing vision, tactile, and hybrid-based approaches struggle with occlusion, noise, and limited generalization, often requiring extensive retraining or large annotated datasets. In this work, we present M-VTOP, a modular framework for in-hand object pose estimation that integrates vision, tactile, and contact sensing in a flexible manner, allowing robustness against noisy or missing modalities. At the core of the framework is a belief-based particle filter that fuses heterogeneous sensor ob- servations, maintains probabilistic estimates, and continuously refines them toward high-precision convergence in closed-loop robotic control with the pose estimation feedback. A mask- based observation representation unifies visual and tactile signals into geometry-centric inputs, enhancing robustness to texture and lighting variations while supporting zero-shot generalization. The framework requires only an object’s CAD model and avoids task-specific retraining. Experiments show that M-VTOP achieves sub-millimeter accuracy under complex geometries, occlusions, and tight tolerances, demonstrating its promise for high-precision robotic manipulation.

Index terms

Computer Vision for Automation Sensor-based Control Assembly