← Back ICRA 2026

FlightDiffusion: Revolutionizing Autonomous Drone Training with Diffusion Model Generating FPV Video

Valerii Serpiva, Artem Lykov, Faryal Batool, Vladislav Kozlovskiy, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

PDF

AI summary

Key figure (auto-extracted from paper)

FlightDiffusion enables scalable, physically consistent drone navigation training by generating FPV flight videos from text prompts and extracting executable trajectories via visual odometry, achieving effective sim-to-real transfer.

Autonomous drones Diffusion models FPV video generation Visual odometry Sim-to-real transfer Trajectory synthesis

Problem

Autonomous drone navigation policies struggle with sample inefficiency and rely on expensive, dangerous real-world data collection, while existing methods fail to tightly couple high-level semantic planning with low-level physical control.

Approach

The system conditions an image-to-video diffusion model on a single frame and task prompt to generate FPV flight sequences, then reconstructs physically consistent 3D trajectories and state-action pairs using monocular visual odometry for scalable policy training.

Key results

Low trajectory tracking error (0.25 m positional, 0.19 rad orientation RMSE)
Statistically equivalent sim-to-real navigation performance
Scalable synthesis of diverse, action-rich FPV training datasets
Effective long-horizon skill repetition from generated video

Why it matters

This approach offers a scalable, cost-effective alternative to real-world data collection for training robust autonomous UAV navigation policies, bridging the gap between high-level semantic reasoning and low-level control.

Abstract

We present FlightDiffusion, a diffusion-based framework for training autonomous drones from first-person view (FPV) video. The model generates FPV video sequences from a single frame and a text prompt, and derives correspond- ing state-action trajectories for task-conditioned navigation. FlightDiffusion leverages generative modeling to synthesize diverse FPV trajectories and corresponding state-action pairs, enabling scalable dataset generation without the high cost of real-world data collection. These datasets support not only the learning pipeline but also the training of autonomous navigation systems. Our evaluation shows that the generated trajectories are physically feasible and executable, with a mean positional error of 0.25 m (RMSE 0.28 m) and a mean orientation error of 0.19 rad (RMSE 0.24 rad). This approach establishes scalable dataset generation and supports reliable navigation performance. Results in simulated environments indicate stable trajectory planning and consistent behavior across varying conditions. An ANOVA revealed no statistically significant difference between performance in simulation and reality (F(1, 16) = 0.394, p = 0.541), with success rates of M = 0.628 (SD = 0.162) and M = 0.617 (SD = 0.177), respectively, indicating effective sim-to-real transfer. The generated datasets provide a useful resource for future UAV research. This work introduces diffusion-based video generation as a promising mechanism for coupling task-level reasoning with executable trajectory synthesis in aerial robotics.

Index terms

Cognitive Control Architectures Visual Tracking Aerial Systems: Perception and Autonomy