Self-Organised Sequential Multi-Agent Reinforcement Learning for Closely Cooperation Tasks
Hao Fu, YOU MINGYU, Zhou Hongjun, Bin He
AI summary
Problem
Standard multi-agent reinforcement learning struggles with closely cooperative tasks where simultaneous actions are required, as individual optimal policies often conflict with group optima and trap agents in local Nash equilibria.
Approach
The method converts parallel agent decisions into sequential, autoregressive action selection within automatically formed groups, using recursive reward decomposition to align individual policies with global objectives.
Key results
- Sequential decision-making framework that aligns individual and group optima
- Automatic grouping mechanism based on state-action coupling scores
- 36% average improvement in task completion rate over state-of-the-art MARL algorithms
- Successful deployment and validation in both simulated and real-world box-pushing environments
Why it matters
Enables scalable and reliable coordination for multi-robot systems in tasks requiring precise simultaneous action, advancing real-world robotic deployment.
Abstract
Cooperative tasks are common in multi-agent systems, with closely cooperative tasks being a special case of this, where a change in the state of the environment requires multiple agents to perform a specific operation at the same time. Take a box-pushing task as an example, the box is heavy and requires multiple agents to push it simultaneously. Optimal actions in a closely cooperation task are correlated with the actions of other agents, which makes the individual optimal action potentially inconsistent with the group optimal action, which leads to more non-globally optimal Nash equilibrium policies in the problem. This makes it easier for the policy learned by reinforcement learning to fall into these locally optimal policies. In this paper, we propose a self-organised sequential multi-agent reinforcement learning algorithm (SOS- MARL). We propose sequential decision-making to change the optimization objective of the agent’s policy so that the learned policy tends to group optimal policies. And propose an automatic grouping mechanism to make the policy smoother for training and reasoning in large-scale agent environments. We decompose the joint action value factorization outside the group into a combination of each group action value, thus guiding the agents to improve their group policies in a fine-grained manner. We deployed scenarios in both simulated and real environments and compared SOS-MARL with various classical MARL algorithms on box-pushing tasks, demonstrating the state-of-the-art of our method.