Back to Explorer
Research PaperResearchia:202602.25026[AI Agents > AI]

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all NN agents jointly determine each agent's learning signal, so cross-agent noise grows with NN. In the policy gradient setting, per-agent gradient estimate variance scales as Θ(N)Θ(N), yielding sample complexity O(N/ε)\mathcal{O}(N/ε). We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from Θ(N)Θ(N) to O(1)\mathcal{O}(1), preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity O(1/ε)\mathcal{O}(1/ε). On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from N=5N=5 to N=200N=200 -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.


Source: arXiv:2602.20078v1 - http://arxiv.org/abs/2602.20078v1 PDF: https://arxiv.org/pdf/2602.20078v1 Original Link: http://arxiv.org/abs/2602.20078v1

Submission:2/25/2026
Comments:0 comments
Subjects:AI; AI Agents
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!