Research Analyzer
← Back ICRA 2026

Flow before Imitation: Learning Dexterous In-Hand Manipulation with Dynamic Visuotactile Shortcut Policy

Yijin Chen, Wenqiang Xu, Zhenjun Yu, Tutian Tang, Yutong Li, Siqiong Yao, Cewu Lu

PDF

AI summary

Key figure (auto-extracted from paper)
FBI dynamically fuses tactile and visual cues through object motion flow to enable real-time dexterous in-hand manipulation, significantly outperforming baselines in simulation and reality.
Dexterous Manipulation Visuotactile Learning Imitation Learning Diffusion Policy Sim-to-Real Transfer Tactile Sensing

Problem

Dexterous in-hand manipulation faces challenges from complex contact dynamics and partial observability, while existing visuotactile methods rely on static fusion or bulky sensors that hinder real-world adaptability.

Approach

FBI extracts tactile information from temporal object motion flow using a dynamics-aware latent model, dynamically fuses it with visual inputs via a transformer, and trains a one-step shortcut diffusion policy for real-time execution.

Key results

  • Dynamic visuotactile fusion enables dual vision-only and visuo-tactile operational modes
  • Achieves 64.7% to 66.5% average simulation success, surpassing prior SOTA by up to 18.4%
  • Delivers 33.5% to 35.0% real-world success rates across in-hand and Adroit benchmark tasks
  • Flow2Tactile module predicts dense contact states from point cloud flow with 85.5% accuracy

Why it matters

It enables robust, real-time dexterous manipulation in sensor-limited environments, advancing practical deployment of robotic hands for complex manipulation tasks.

Abstract

Dexterous in-hand manipulation remains a long- standing challenge in robotics, primarily due to the complex contact dynamics and partial observability. While humans synergize vision and touch for such tasks, robotic approaches often prioritize one modality, therefore limiting adaptability. This paper introduces Flow Before Imitation (FBI), a visuo- tactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer- based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real- time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks. Code, models, and more results are available on the website https://sites.google.com/view/dex-fbi.

Index terms

Imitation Learning In-Hand Manipulation Machine Learning for Robot Control

Related papers