← Back ICRA 2026

ManipForce: Force-Guided Policy Learning with Frequency-Aware Representation for Contact-Rich Manipulation

Geonhyup Lee, Youngjin Lee, Kangmin Kim, Seongju Lee, Sangjun Noh, Seunghyeok Back, Kyoobin Lee

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating high-frequency force-torque data with vision via a frequency-aware transformer boosts success rates in contact-rich manipulation tasks by over 60% compared to vision-only baselines.

contact-rich manipulation force-torque sensing multimodal learning imitation learning frequency-aware transformer robotic assembly

Problem

Existing imitation learning methods for contact-rich manipulation rely on vision-only demonstrations, missing critical force cues, while collecting high-fidelity multimodal data remains difficult and costly.

Approach

The authors introduce a handheld system to capture natural human demonstrations with synchronized RGB and high-frequency force-torque signals, and propose the Frequency-Aware Multimodal Transformer (FMT) to fuse these asynchronous inputs using frequency-aware embeddings and cross-attention within a diffusion policy.

Key results

83% average success rate across six real-world contact-rich tasks
Substantial performance gains over RGB-only baselines (22% average success)
High-frequency F/T sensing and bi-directional cross-attention proven critical for precise contact control
Direct real-world transfer of human demonstrations with sub-millimeter pose accuracy and gravity-compensated force measurements

Why it matters

Provides a practical, open-source framework for capturing and learning from critical haptic feedback, advancing robust robotic assembly and precision manipulation.

Abstract

Contact-rich manipulation tasks such as precision assembly require precise control of interaction forces, yet existing imitation learning methods rely mainly on vision- only demonstrations. We propose ManipForce, a handheld system designed to capture high-frequency force–torque (F/T) and RGB data during natural human demonstrations for contact-rich manipulation. Building on these demonstrations, we introduce the Frequency-Aware Multimodal Transformer (FMT). FMT encodes asynchronous RGB and F/T signals using frequency- and modality-aware embeddings and fuses them via bi-directional cross-attention within a transformer diffu- sion policy. Through extensive experiments on six real-world contact-rich manipulation tasks—such as gear assembly, box flipping, and battery insertion—FMT trained on ManipForce demonstrations achieves robust performance with an average success rate of 83% across all tasks, substantially outperforming RGB-only baselines. Ablation and sampling-frequency analyses further confirm that incorporating high-frequency F/T data and cross-modal integration improves policy performance, es- pecially in tasks demanding high precision and stable contact. Hardware, software, and video demos are available at: https: //sites.google.com/view/manipforce/. 1 Department of AI Convergence, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea. 2 Department of AI Machinery, Korea Institute of Machinery & Materials (KIMM), Daejeon 34103, Republic of Korea. † Corresponding author: Kyoobin Lee kyoobinlee@gist.ac.kr

Index terms

Imitation Learning Learning from Demonstration Force and Tactile Sensing