← Back ICRA 2026

CRAFT: Adapting VLA Models to Contact-Rich Manipulation Via Force-Aware Curriculum Fine-Tuning

Yike Zhang, Yaonan Wang, Xinxin Sun, Kaizhen Huang, Zhiyuan Xu, Ji Junjie, Zhengping Che, Jian Tang, Kangcheng Liu, Jingtao Sun

PDF

AI summary

Key figure (auto-extracted from paper)

CRAFT enables robust contact-rich manipulation by using a curriculum that forces VLA models to prioritize force signals before reintegrating visual and language data.

VLA models contact-rich manipulation force-aware learning variational information bottleneck curriculum fine-tuning

Problem

VLA models struggle with contact-rich tasks because they over-rely on high-entropy vision and language inputs, ignoring critical low-entropy force signals needed for precise alignment and stability.

Approach

A framework that uses a variational information bottleneck to temporarily compress perceptual embeddings, forcing the model to learn from force signals first before gradually restoring multimodal access via a curriculum schedule.

Key results

Improved task success rates across five real-world contact-rich manipulation tasks
Enhanced generalization to unseen objects and novel task variations
Effective adaptation across diverse VLA architectures including RDT and π0
Development of a homologous leader-follower teleoperation system for synchronized data collection

Why it matters

It provides a model-agnostic method to equip general-purpose robotic agents with the ability to handle physically demanding, contact-heavy interactions.

Abstract

Vision–Language–Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high- entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vi- sion and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader–follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.

Index terms

AI-Enabled Robotics Imitation Learning Force Control