← Back ICRA 2026

Diffusion Trajectory-Guided Policy for Long-Horizon Robot Manipulation

shichao Fan, Quantao Yang, yajie liu, Kun Wu, Zhengping Che, qingjie Liu, Min Wan

PDF

AI summary

Key figure (auto-extracted from paper)

A diffusion-based trajectory generator guides robot policies, boosting long-horizon manipulation success rates by 25% without external pretraining.

Imitation Learning Diffusion Models Long-horizon Manipulation Vision-Language-Action Robot Policy CALVIN Benchmark

Problem

Imitation learning for long-horizon robotic tasks struggles with compounding errors and scarce demonstration data, leading to cascading failures and poor generalization.

Approach

The framework uses a two-stage process: first, a vision-language diffusion model generates task-relevant 2D trajectories, which then guide the training of a robot manipulation policy to reduce error accumulation.

Key results

25% higher average success rate on the CALVIN benchmark
Trained from scratch without external pretraining
Significant real-world robot performance improvements
Computationally efficient training on consumer-grade GPUs

Why it matters

Enables reliable, data-efficient long-horizon robotic manipulation for real-world applications by bridging high-level language instructions with precise motor control.

Abstract

Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory- guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long- horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumula- tion. Our two-stage approach first trains a generative vision- language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance. Our project is at diffusion-trajectory-guided- policy.github.io/.

Index terms

Imitation Learning Learning from Demonstration Deep Learning in Grasping and Manipulation