Mean-Field Reinforcement Learning without Synchrony
Abstract
Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of , and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to . We therefore construct the Temporal Mean Field (TMF) framework around the population distribution from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all act per step, with approximation error decaying at the predicted rate.
Source: arXiv:2602.18026v1 - http://arxiv.org/abs/2602.18026v1 PDF: https://arxiv.org/pdf/2602.18026v1 Original Link: http://arxiv.org/abs/2602.18026v1