← Back ICRA 2026

Video-To-BT: Generating Reactive Behavior Trees from Human Demonstration Videos for Robotic Assembly

Xiwei Zhao, Yiwei Wang, Yansong Wu, Fan Wu, Teng Sun, Zhonghua Miao, Sami Haddadin, Alois Knoll

PDF

AI summary

Key figure (auto-extracted from paper)

A VLM-driven framework successfully converts human demonstration videos into reactive Behavior Trees, enabling robust and adaptive robotic assembly in dynamic environments.

Behavior Trees Vision Language Models Robotic Assembly Human Demonstration Reactive Control Automated Planning

Problem

Traditional robotic assembly relies on rigid, expert-coded programs that lack flexibility and robustness to handle product variations and shop-floor disturbances. Existing automated BT generation methods still require substantial manual input or deterministic settings, limiting their practical use in dynamic, real-world tasks.

Approach

The authors propose Video-to-BT, a closed-loop framework that uses a Vision Language Model to decompose human demonstration videos into subtasks and automatically generate executable Behavior Trees. These trees are executed alongside real-time semantic perception, enabling reactive control and automatic replanning when disturbances occur.

Key results

VLM-driven pipeline for automated BT generation from human videos
Closed-loop supervisory control integrating real-time semantic perception
High planning accuracy and robust long-horizon assembly completion
Strong generalization and disturbance recovery in dynamic environments

Why it matters

Enables non-experts to program flexible, robust robotic assembly systems for smart manufacturing without manual coding or extensive training data.

Abstract

Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision Language Model (VLM) to decom- pose human demonstration videos into subtasks, from which BTs are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM- driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions. Project website: https://video2bt.github.io/video2bt page/

Index terms

Task Planning AI-Enabled Robotics Intelligent and Flexible Manufacturing