Transformer-Based Hierarchical Reinforcement Learning for Sequential Decision-Making in Swarm Confrontation
Ruozhai Sun, Qizhen Wu, Lei Chen
AI summary
Problem
High-level policies in hierarchical reinforcement learning struggle to reason over dynamic, variable-sized observations of other agents, limiting strategic capabilities in long-horizon swarm confrontation tasks.
Approach
We propose a decentralized two-level framework where a Transformer-based high-level policy uses self-attention to reason over variable-sized entity sets for strategic task allocation, which is then executed by a low-level motion controller via task-aware potential fields.
Key results
- Achieves up to 81% win rates against rule-based baselines in complex swarm confrontations
- Demonstrates strong zero-shot generalization to larger, unseen swarm scales without retraining
- Enables interpretable decision-making through Transformer attention visualization
- Fosters autonomous emergence of sophisticated cooperative tactics
Why it matters
Provides a scalable, interpretable blueprint for training strategically sophisticated multi-agent systems in complex, dynamic environments.
Abstract
Hierarchical Reinforcement Learning (HRL) is a potent paradigm for addressing long–horizon sequential decision–making in swarm confrontation. However, its strategic capabilities are often bottlenecked by high–level policies that struggle to reason over the dynamic, variable–sized observations of other agents. To address this, we introduce a novel decentral- ized HRL framework featuring a Transformer–based strategic policy. The Transformer’s self–attention mechanism is uniquely suited to capture complex spatio–temporal relationships among a varying number of entities, enabling robust long–horizon task allocation. This high–level strategy is then translated by a low– level policy into collision–free navigation. In complex swarm confrontation scenarios, our method significantly outperforms established baselines, achieving win rates of up to 81%. Beyond this performance, the learned policies exhibit strong zero– shot generalization to larger swarms, offer decision–making interpretability via the attention mechanism, and foster the au- tonomous emergence of complex cooperative tactics. This work provides a blueprint for scalable, strategically sophisticated, and interpretable multi–agent systems.