Scaling Multi-Agent Reinforcement Learning for Underwater Acoustic Tracking Via Autonomous Vehicles
Matteo Gallici, Ivan Masmitja, Mario Martin
AI summary
Problem
Scaling multi-agent reinforcement learning for underwater tracking is hindered by the sample inefficiency of MARL and the extreme computational cost of high-fidelity simulators, while existing methods fail to generalize across varying fleet and target sizes.
Approach
The authors introduce a GPU-vectorized simplified environment for rapid training alongside a high-fidelity simulator for evaluation, paired with a Transformer-based architecture that learns policies invariant to the number of agents and targets via curriculum learning.
Key results
- Achieves up to 30,000× training speedup over Gazebo
- Policies remain invariant to fleet size and target count
- Tracks 5 fast-moving targets with only 5 vehicles
- Maintains tracking errors below 5m in high-fidelity simulation
Why it matters
Provides a scalable, computationally efficient framework for training cooperative autonomous vehicle fleets, accelerating progress in marine monitoring and underwater research.
Abstract
Autonomous vehicles (AVs) o!er a cost-e!ective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi- target tracking or rapidly moving targets) is challenging. Multi- Agent RL (MARL) is notoriously sample-ine”cient, and while high-fidelity simulators like Gazebo’s LRAUV provide up to 100× faster-than-real-time single-robot simulations, they o!er little speedup in multi-vehicle scenarios, making MARL train- ing impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000× speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.