In multi-agent cooperative reinforcement learning (MARL), due to its *on politics* By nature, policy gradient (PG) methods are generally considered less sampling efficient than value decomposition (VD) methods, which are *outside politics*. However, some recent empirical studies demonstrate that with appropriate input representation and hyper-parameter tuning, multi-agent PG can achieve surprisingly solid performance compared to non-policy DV methods.

**Why might PG methods work so well?** In this article, we will present a concrete analysis to show that in certain scenarios, for example in environments with a highly multimodal reward landscape, DV can be problematic and lead to undesirable outcomes. In contrast, PG methods with individual policies can converge to an optimal policy in these cases. Additionally, PG methods with auto-regressive (AR) policies can learn multimodal policies.

*Figure 1: different political representations for the 4-player permutation game.
*

## CTDE in Cooperative MARL: VD and PG methods

Centralized training and decentralized execution (CTDE) is a popular MARL cooperative framework. He exploits *global* information for more effective training while maintaining individual policy representation for testing. CTDE can be implemented via value decomposition (VD) or policy gradient (PG), leading to two different types of algorithms.

VD methods learn local Q-networks and a mixture function that mixes local Q-networks into a global Q-function. The mixing function is generally applied to satisfy the Individual-Global-Max (IGM), which guarantees the optimal joint action, can be calculated by greedily choosing the optimal action locally for each agent.

In contrast, PG methods directly apply the policy gradient to learn an individual policy and a centralized value function for each agent. The value function takes as input the global state (for example, MAPPO) or the concatenation of all local observations (e.g., MADDPG), for an accurate estimate of the overall value.

## The permutation game: a simple counterexample where VD fails

We begin our analysis by considering a stateless cooperative game, namely the permutation game. In an $N$-player permutation game, each agent can generate $N$ actions ${ 1,\ldots, N }$. Agents receive a reward $+1$ if their actions are mutually different, that is, the joint action is a permutation on $1, \ldots, N$; otherwise, they receive a $0 reward. Note that there are $N!$ symmetric optimal strategies in this game.

*Figure 2: The 4-player permutation game.
*

*Figure 3: High-level intuition about why VD fails in the 2-player permutation game.
*

Now let’s focus on the 2-player permutation game and apply VD to the game. In this stateless setting, we use $Q_1$ and $Q_2$ to denote local Q functions, and use $Q_\textrm{tot}$ to denote the global Q function. The IGM principle requires that

\(\arg\max_{a^1,a^2}Q_\textrm{tot}(a^1,a^2)=\{\arg\max_{a^1}Q_1(a^1),\ arg\max_{a^2}Q_2(a^2)\}.\)

We prove that VD cannot represent the payoff of the 2-player permutation game by contradiction. If VD methods were able to represent the gain, we would have

\(Q_\textrm{tot}(1, 2)=Q_\textrm{tot}(2,1)=1\quad \text{and}\quad Q_\textrm{tot}(1, 1)=Q_\ textrm{tot}(2,2)=0.\)

If one of these two agents has different local Q values (for example $Q_1(1)> Q_1(2)$), we have $\arg\max_{a^1}Q_1(a^1)=1 $. So according to the IGM principle, *any of them* optimal joint action

\((a^{1\star},a^{2\star})=\arg\max_{a^1,a^2}Q_\textrm{tot}(a^1,a^2)=\ {\arg\max_{a^1}Q_1(a^1),\arg\max_{a^2}Q_2(a^2)\}\)

satisfies $a^{1\star}=1$ and $a^{1\star}\neq 2$, so the joint action $(a^1,a^2)=(2,1)$ is under – optimal, that is to say $Q_\textrm{tot}(2,1)<1$.

Otherwise, if $Q_1(1)=Q_1(2)$ and $Q_2(1)=Q_2(2)$, then

\(Q_\textrm{tot}(1, 1)=Q_\textrm{tot}(2,2)=Q_\textrm{tot}(1, 2)=Q_\textrm{tot}(2,1). \)

Therefore, the decomposition of values cannot represent the payoff matrix of the 2-player permutation game.

What about PG methods? Individual policies may indeed represent an optimal policy for the permutation game. Additionally, stochastic gradient descent can ensure that PG converges to one of these optima. under moderate assumptions. This suggests that although PG methods are less popular in MARL than VD methods, they may be preferable in some common cases in real-world applications, for example games with multiple strategy modalities.

We also notice that in the permutation game, in order to represent an optimal common policy, each agent must choose distinct actions. **Therefore, a successful PG implementation must ensure that policies are agent-specific.** This can be done using either individual policies with non-shared parameters (called PG-Ind in our article), or a policy conditioned by the agent ID (ID-PG).

## PG outperforms existing VD methods on popular MARL benchmarks

Beyond the simple illustrative example of the permutation game, we extend our study to popular and more realistic MARL benchmarks. In addition to the StarCraft Multi-Agent Challenge (SMAC), where the effectiveness of the PG and the political contribution conditioned by the agent has been verifiedwe show new results in Google Research Football (GRF) and multiplayer Hanabi Challenge.

*Figure 4: (left) success rate of PG methods on GRF; (right) best and average review scores on Hanabi-Full.
*

In GRF, PG methods outperform the state-of-the-art VD baseline (CDS) in 5 scenarios. Interestingly, we also notice that individual policies (PG-Ind) without parameter sharing achieve comparable, sometimes even higher, success rates compared to agent-specific policies (PG-ID) within 5 scenarios. We evaluate PG-ID in the large-scale Hanabi game with varying number of players (2-5 players) and compare them to SADa strong variant of Q-learning outside of politics in Hanabi, and Value Decomposition Networks (VDN). As demonstrated in the table above, PG-ID is capable of producing results comparable to or better than the best average rewards achieved by SAD and VDN with varying numbers of players using the same number of environment stages.

## Beyond Higher Rewards: Learning Multimodal Behavior via Autoregressive Policy Modeling

In addition to learning higher rewards, we also study how to learn multimodal policies in cooperative ADR. Back to the permutation game. Although we have proven that PG can indeed learn an optimal policy, the policy mode it finally achieves may strongly depend on the initialization of the policy. So, a natural question will be:

*Can we learn a single policy that can cover all optimal modes?
*

In the decentralized formulation of the PG, the factorized representation of a common policy can only represent a particular mode. Therefore, we propose an improved method to parameterize policies for greater expressiveness: auto-regressive (AR) policies.

*Figure 5: Comparison between individual policies (PG) and auto-regressive policies (AR) in the 4-player permutation game.
*

Formally, we factor the common policy of $n$ agents into the form of

\(\pi(\mathbf{a} \mid \mathbf{o}) \approx \prod_{i=1}^n \pi_{\theta^{i}} \left( a^{i}\mid o ^{i},a^{1},\ldots,a^{i-1} \right),\)

where the action produced by agent $i$ depends on its own observation $o_i$ and on all the actions of previous agents $1,\dots,i-1$. Autoregressive factorization can represent *any of them* common policy in a centralized CDM. THE *only* the modification of each agent’s policy is the input dimension, which is slightly expanded by including previous actions; and the production dimension of each agent’s policy remains unchanged.

With such minimal parameterization overhead, the AR policy significantly improves the representation power of PG methods. We note that the PG policy with AR (PG-AR) can simultaneously represent all optimal policy modes in the permutation game.

*Figure: Action heatmaps for policies learned by PG-Ind (left) and PG-AR (middle), and heatmap for rewards (right); While PG-Ind only converges to a specific mode in the 4-player permutation game, PG-AR successfully discovers all optimal modes.
*

In more complex environments, including SMAC and GRF, PG-AR can learn interesting emergent behaviors that require strong intra-agent coordination that may never be learned by PG-Ind.

*Figure 6: (left) emergent behavior induced by PG-AR in SMAC and GRF. In SMAC’s 2m_vs_1z map, marines stand and attack alternately while ensuring that there is only one marine attacking each time step; (right) In GRF’s academy_3_vs_1_with_keeper scenario, agents learn “Tiki-Taka” style behavior: each player continues to pass the ball to their teammates.
*

## Discussions and takeaways

In this article, we propose a concrete analysis of the VD and PG methods in cooperative MARL. First, we reveal the limitation of the expressiveness of popular VD methods, showing that they cannot represent optimal policies even in a simple permutation game. On the other hand, we show that PG methods are clearly more expressive. We empirically verify the expressiveness advantage of PG on popular MARL benchmarks including SMAC, GRF, and Hanabi Challenge. We hope that the insights from this work can benefit the community toward more general and powerful cooperative MARL algorithms in the future.

*This article is based on our article: Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning (paper, website).*