← Back ICRA 2026

Transformer-Based Hierarchical Reinforcement Learning for Sequential Decision-Making in Swarm Confrontation

Ruozhai Sun, Qizhen Wu, Lei Chen

PDF

AI summary

Key figure (auto-extracted from paper)

A Transformer-based hierarchical reinforcement learning framework achieves up to 81% win rates in swarm confrontations while enabling zero-shot generalization, interpretability, and emergent cooperative tactics.

Hierarchical Reinforcement Learning Transformer Swarm Confrontation Multi-Agent Systems Zero-Shot Generalization Strategic Planning

Problem

High-level policies in hierarchical reinforcement learning struggle to reason over dynamic, variable-sized observations of other agents, limiting strategic capabilities in long-horizon swarm confrontation tasks.

Approach

We propose a decentralized two-level framework where a Transformer-based high-level policy uses self-attention to reason over variable-sized entity sets for strategic task allocation, which is then executed by a low-level motion controller via task-aware potential fields.

Key results

Achieves up to 81% win rates against rule-based baselines in complex swarm confrontations
Demonstrates strong zero-shot generalization to larger, unseen swarm scales without retraining
Enables interpretable decision-making through Transformer attention visualization
Fosters autonomous emergence of sophisticated cooperative tactics

Why it matters

Provides a scalable, interpretable blueprint for training strategically sophisticated multi-agent systems in complex, dynamic environments.

Abstract

Hierarchical Reinforcement Learning (HRL) is a potent paradigm for addressing long–horizon sequential decision–making in swarm confrontation. However, its strategic capabilities are often bottlenecked by high–level policies that struggle to reason over the dynamic, variable–sized observations of other agents. To address this, we introduce a novel decentral- ized HRL framework featuring a Transformer–based strategic policy. The Transformer’s self–attention mechanism is uniquely suited to capture complex spatio–temporal relationships among a varying number of entities, enabling robust long–horizon task allocation. This high–level strategy is then translated by a low– level policy into collision–free navigation. In complex swarm confrontation scenarios, our method significantly outperforms established baselines, achieving win rates of up to 81%. Beyond this performance, the learned policies exhibit strong zero– shot generalization to larger swarms, offer decision–making interpretability via the attention mechanism, and foster the au- tonomous emergence of complex cooperative tactics. This work provides a blueprint for scalable, strategically sophisticated, and interpretable multi–agent systems.

Index terms

Reinforcement Learning Multi-Robot Systems Task and Motion Planning