← Back ICRA 2026

STAF-Navi: Vision-Based Spatio-Temporal Attention Fusion Navigation Framework

haowen zhang, fanghong liu, chaoyu zhang, qiuze yu

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating spatio-temporal attention with deep collision encoding significantly improves UAV navigation success and path efficiency in cluttered, unknown environments.

UAV navigation spatio-temporal attention reinforcement learning deep collision encoding sim-to-real cluttered environments

Problem

Current deep reinforcement learning agents for UAVs lack long-term memory, causing them to struggle with partial observability, target retention, and dynamic obstacle avoidance in cluttered environments.

Approach

STAF-Navi fuses historical depth images and flight states using a Transformer-based actor and GRU-based critic, while a deep collision encoder compresses spatial data into latent obstacle representations for real-time decision-making.

Key results

10% increase in simulation navigation success rate
7% improvement in path efficiency
Optimal temporal window of H=20 with exponential weighting
Successful sim-to-real deployment in real-world UAV tests

Why it matters

Provides a robust, memory-augmented navigation solution for autonomous UAVs operating in complex, dynamic, and partially observable environments.

Abstract

In cluttered, unknown, and partially observable envi- ronments, Uncrewed Aerial Vehicle (UAV) navigation encounters formidable challenges. To address these challenges, we propose an innovative spatio-temporal attention fusion navigation framework called STAF-Navi. The framework integrates spatio-temporal at- tention mechanisms to model sequential dependencies. It captures spatial and temporal correlations from historical observations and actions to improve navigation and obstacle avoidance. STAF-Navi employs deep collision encoding to compress high-dimensional depth images into informative low-dimensional latent states, and a single-site Transformer to model historical sensor inputs and states, enhancing the utility of current observations. By exploiting tempo- ral dependencies, this integration enables early braking and stable hovering. Extensive simulation experiments show that the frame- work increases the navigation success rate by 10% and improves path efficiency by 7%. Finally, the successful deployment of the proposed strategy in real-world scenarios validates its effectiveness.

Index terms

Vision-Based Navigation Reinforcement Learning Aerial Systems: Perception and Autonomy