← Back ICRA 2026

M2GRPO: Mamba-Based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Yukai Feng, Zhiheng Wu, Zhengxing Wu, Junwen Gu, Junzhi Yu, Min Tan

PDF

AI summary

M2GRPO enables stable, efficient cooperative pursuit for biomimetic underwater robots by combining Mamba-based temporal modeling with group-relative policy optimization.

Mamba Multi-Agent Reinforcement Learning Underwater Robots Cooperative Pursuit Group-Relative Policy Optimization Biomimetic Robotics

Problem

Traditional multi-agent reinforcement learning struggles with long-horizon decision-making, partial observability, and inter-robot coordination in biomimetic underwater pursuit tasks, often relying on memoryless MLPs that lack temporal and relational expressiveness.

Approach

The framework replaces standard policy networks with a Mamba-based architecture to capture long-term dependencies and agent interactions, while introducing MAGRPO to normalize advantages across parallel environments and eliminate the need for explicit value networks.

Key results

Mamba policy efficiently captures long-range temporal dependencies and dynamic inter-agent relations
MAGRPO eliminates explicit value networks while maintaining training stability and reducing computational overhead
Consistently outperforms MAPPO and recurrent baselines in pursuit success rate and capture efficiency
Successfully validated through both extensive simulations and real-world pool experiments on biomimetic robot sharks

Why it matters

Offers a scalable, resource-efficient solution for cooperative underwater multi-robot systems, advancing practical deployment of intelligent marine swarms.

Abstract

Traditional policy learning methods in coopera- tive pursuit face fundamental challenges in biomimetic un- derwater robots, where long-horizon decision making, par- tial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M2GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized- execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reduc- ing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real- world pool experiments across team scales and evader strategies demonstrate that M2GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

Index terms

Biologically-Inspired Robots Reinforcement Learning Cooperating Robots