← Back ICRA 2026

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Runze Tang, Penny Sweetser

PDF

AI summary

Key figure (auto-extracted from paper)

Predicting full-scene flow from human videos enables robots to generalize to new manipulation tasks with minimal demonstration data.

Imitation Learning Cross-Embodiment Scene Flow Diffusion Policy Few-Shot Learning Robot Manipulation

Problem

Collecting large-scale robot demonstrations for imitation learning is costly, while prior flow-based methods only track partial motion and fail to generalize to unseen human-video scenarios.

Approach

The method predicts any-point scene trajectories using a Transformer trained on human videos and robot data, then conditions a diffusion policy on this flow and cropped point clouds to balance generalization and precision.

Key results

High cross-embodiment data efficiency in flow prediction with minimal robot demonstrations
Strong spatial and instance generalization to tasks only observed in human videos
Reduced diffusion policy overfitting through flow conditioning and point cloud cropping
Superior success rates over DP3, RISE, and SUGAR baselines across real-world tasks

Why it matters

Provides a scalable, low-cost pathway for robots to learn complex manipulation skills from abundant human video data.

Abstract

Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task mod- eling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of inter- action. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.

Index terms

Imitation Learning Learning from Demonstration Machine Learning for Robot Control