← Back ICRA 2026

Joint Flow Trajectory Optimization for Feasible Robot Motion Generation from Video Demonstrations

Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi

PDF

AI summary

Key figure (auto-extracted from paper)

Jointly optimizing grasp feasibility and trajectory imitation via flow matching enables robots to execute video-based demonstrations more accurately and safely than sequential methods.

video-based learning flow matching grasp optimization trajectory imitation robot motion planning SE(3) modeling

Problem

Directly tracking human hand motions from videos often violates robot joint constraints due to embodiment differences. This work addresses how to generate kinematically feasible grasp poses and object trajectories that consistently imitate human video demonstrations.

Approach

The proposed JFTO framework treats video demonstrations as object-centric guides and jointly optimizes a differentiable objective that balances grasp similarity, trajectory likelihood, and collision avoidance. It extends flow matching to SE(3) to model demonstrated object pose distributions probabilistically.

Key results

Unified differentiable objective integrating grasp feasibility, trajectory imitation, and collision avoidance
Extension of flow matching to SE(3) for density-aware, multi-modal trajectory modeling
Higher imitation fidelity in simulation and real-world experiments compared to sequential baselines

Why it matters

Enables scalable, video-based robot skill transfer by bridging the embodiment gap and ensuring kinematically feasible, collision-free execution.

Abstract

Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning-from- Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to SEp3q for probabilistic modeling of object trajectories, enabling density- aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.

Index terms

Learning from Demonstration Probabilistic Inference