How to optimize multi-agent RL for mean-variance objectives?
Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games
This paper proposes novel algorithms for optimizing long-run mean-variance in team stochastic games (TSGs), a cooperative multi-agent setting where agents share a common reward. Traditional dynamic programming approaches fail due to the non-additive and non-Markovian nature of variance. The key innovation is applying sensitivity-based optimization to derive performance difference and derivative formulas. This enables the creation of a policy iteration algorithm (MV-MAPI) with a sequential update scheme, guaranteeing convergence to a first-order stationary point. Further, a multi-agent reinforcement learning algorithm (MV-MATRPO) using trust-region optimization is proposed for model-free scenarios, allowing for application in complex environments. Relevant to LLM-based systems, these algorithms address challenges in cooperative multi-agent learning involving long-term performance and volatility, sequential decision-making, and scalability to larger state spaces through function approximation with neural networks. The sequential update within a centralized training framework offers a practical implementation approach for complex multi-agent scenarios where direct policy iteration is intractable.