Can Q-learning agents reliably cooperate?
Deterministic Model of Incremental Multi-Agent Boltzmann Q-Learning: Transient Cooperation, Metastability, and Oscillations
This paper examines how independent Q-learning agents, using a Boltzmann exploration policy, learn in simple, repeated environments like the Prisoner's Dilemma. It finds that existing simplified models fail to capture the complex dynamics of the actual algorithm. The paper introduces a new, more accurate model that accounts for the agents' update frequencies. This model reveals that apparent long-term cooperation is often a metastable, exploitable phase, not a true equilibrium. Furthermore, a high discount factor (valuing future rewards) can lead to persistent oscillations and prevent the agents from settling on a stable joint policy due to the moving target problem.
Key points for LLM-based multi-agent systems:
- Simplified models can be misleading: Commonly used approximations of multi-agent learning dynamics don't accurately reflect the behavior of standard algorithms like independent Q-learning. This has implications for understanding and predicting agent behavior.
- Metastability vs. true equilibrium: LLM-based agents might appear to cooperate for extended periods, but this could be a temporary phase rather than learned, stable behavior.
- Oscillations and the moving target problem: In multi-agent systems, agents constantly adapt to each other, creating a moving target. This can destabilize learning and lead to oscillations, especially when agents highly value future rewards (high discount factor), hindering convergence to a stable outcome. These challenges are amplified in LLMs due to their complexity and the non-stationary nature of their interactions.
- Importance of update frequency: How frequently LLMs update their strategies significantly impacts the overall system dynamics and can lead to unexpected behavior. Careful consideration of update schedules is crucial in LLM-based multi-agent application design.