← Back ICRA 2026

PoCoDP3: Pose- and Contact-Aware Visual-Tactile Policy for Contact-Rich 3D Manipulation

Zhaokun Yue, Ling Tong, Kun Qian

PDF

AI summary

Key figure (auto-extracted from paper)

PoCoDP3 enhances contact-rich robotic manipulation by fusing 3D point clouds with pose- and contact-aware tactile sensing via an efficient reference-guided diffusion policy.

imitation learning visual-tactile sensing diffusion policy 3D manipulation contact-rich tasks

Problem

Vision-only policies struggle with occlusion and fine-grained interaction details in visually ambiguous regions, while existing visual-tactile methods often lack effective fusion strategies for 3D data and suffer from slow inference speeds.

Approach

The framework uses a dual-branch tactile encoder to estimate object pose and contact dynamics, a cross-modal fusion mechanism that adaptively prioritizes sensory inputs, and a reference-guided diffusion policy to accelerate action generation.

Key results

Outperforms representative 2D and 3D policies in accuracy
Reduces diffusion sampling steps to 2-5 while maintaining action quality
Enables structured tactile representations for precise in-hand interaction
Demonstrates superior inference efficiency across simulation and real-world tasks

Why it matters

This approach allows robots to perform high-precision, contact-heavy tasks with real-time responsiveness even when visual data is occluded.

Abstract

Imitation learning in contact-rich tasks requires both global spatial awareness and fine-grained in-hand interaction un- derstanding. However, vision-only policies based on images or point clouds are often susceptible to occlusion and struggle to capture critical contact details, particularly in visually ambiguous regions or during subtle tactile interactions. In this work, we present PoCoDP3, a pose- and contact-aware visual-tactile policy that integrates 3D point clouds and tactile inputs to generate actions in contact-rich tasks. PoCoDP3 introduces a dual-branch tactile encoder that jointly models contact dynamics and estimates in- hand object pose, enabling structured tactile representations for precise contact-rich manipulation. A contact-driven cross-modal fusion mechanism adaptively prioritizes sensory modalities based on real-time interaction cues, enabling efficient visual-tactile in- tegration. Moreover, a reference-guided diffusion policy leverages reference action offsets to reduce sampling steps, significantly ac- celerating inference while maintaining action quality. Experiments across simulation and real-world tasks demonstrate that PoCoDP3 consistentlyoutperformsrepresentative2Dand3Dpoliciesinterms of both accuracy and inference efficiency.

Index terms

Force and Tactile Sensing Imitation Learning Manipulation Planning