Learning Cooperative Strategies for Drone Swarms Using Multi-Agent Reinforcement Learning
Christian Llanes, Kyle Williams, Spencer Jensen, Samuel Coogan
AI summary
Problem
When pursuers have superior speed or control authority, individual evader drones struggle to reach targets without being captured. The paper addresses the gap in scalable, cooperative strategies for asymmetric multi-agent pursuit-evasion scenarios.
Approach
The authors train evader drone teams using Multi-Agent Proximal Policy Optimization (MAPPO) to learn coordinated maneuvers that intentionally guide superior pursuers into mutual collisions.
Key results
- Developed a 6-DOF MAPPO algorithm for multi-agent pursuit-evasion
- Proposed an augmented proportional navigation defense strategy for pursuers
- Validated algorithm adaptability across 2v2 and 4v4 team configurations
- Demonstrated successful sim-to-real transfer on Crazyflie hardware under real-world constraints
Why it matters
Provides a scalable framework for less capable drone swarms to defeat superior opponents through learned coordination, advancing robust multi-agent autonomy for defense and search missions.
Abstract
In this work, we investigate cooperative strategies for an evader drone team of various sizes using multi-agent reinforcement learning in a multi-agent pursuit-evasion sce- nario. The objective of the evader team is to reach a goal with minimal velocity while not colliding with the pursuer team. The objective of the pursuer team is to defend the goal by catching evaders before they reach it. In this environment, we allow the pursuer to have superior control authority compared to the evader such that reaching the goal is challenging for the evader in a one-on-one scenario. The proposed strategy for an evader is to team up with an ally to lead pursuers into a collision with each other instead of intercepting the evader. We design policies using multi-agent proximal policy optimization, an actor-critic reinforcement learning method, and investigate how the learned strategy changes when we vary the size of the pursuer and evader teams. Finally, we demonstrate the learned policy’s sim-to-real capabilities through a hardware demonstration.