← Back ICRA 2026

ViSA-Flow: Accelerating Robot Skill Learning Via Large-Scale Video Semantic Action Flow

Changhe Chen, Quantao Yang, Xiaohao Xu, Nima Fazeli, Olov Andersson

PDF

AI summary

Key figure (auto-extracted from paper)

ViSA-Flow enables robots to learn complex manipulation skills efficiently from large-scale human videos by extracting and transferring semantic action flows, significantly outperforming baselines in low-data regimes.

Robot imitation learning semantic action flow video-based learning few-shot adaptation CALVIN benchmark cross-domain transfer

Problem

Collecting large-scale, high-quality robot demonstrations is prohibitively expensive and limits the scalability of robot imitation learning, while existing video-based methods often rely on low-level motion flow that misses higher-level semantic cues humans naturally use.

Approach

The framework extracts weakly supervised semantic action flows (hand-object interaction masks amplified by temporal tracking) from unlabeled human videos to pre-train a generative policy, then fine-tunes it on a small set of robot demonstrations for efficient cross-domain skill transfer.

Key results

Pre-trains a generative policy on large-scale human video semantic action flows
Refines the policy via few-shot robot demonstrations with robust semantic alignment
Achieves state-of-the-art performance on the CALVIN benchmark and real-world tasks
Demonstrates effective zero-shot generalization across unseen environments and novel objects

Why it matters

It provides a scalable, data-efficient pathway for robots to acquire complex manipulation skills by leveraging abundant internet video data, reducing reliance on costly robot demonstrations.

Abstract

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object in- teractions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self- supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows auto- matically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demon- strate through extensive experiments on the CALVIN bench- mark and real-world tasks that ViSA-Flow achieves state-of-the- art performance, particularly in low-data regimes, outperform- ing prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are avail- able at https://visaflow-web.github.io/ViSAFLOW.

Index terms

Imitation Learning Learning from Demonstration Deep Learning in Grasping and Manipulation