How can I fairly rate LLMs in multi-agent games?
DEVIATION RATINGS: A GENERAL, CLONE INVARIANT RATING METHOD
February 18, 2025
https://arxiv.org/pdf/2502.11645This paper introduces deviation ratings, a new method for evaluating the performance of multiple AI agents (or strategies) interacting strategically. It addresses the problem of traditional rating systems being skewed by the inclusion of many similar strategies, particularly in complex scenarios like those found in modern LLM evaluations.
For LLM-based multi-agent systems, deviation ratings offer several benefits:
- Clone Invariance: Adding copies of existing agents doesn't change the ratings, making the evaluation robust to redundant data and preventing manipulation through the submission of many similar prompts or models.
- N-player General-Sum Applicability: Unlike simpler methods like Elo, deviation ratings work for any number of agents and in scenarios where interactions aren't strictly competitive (e.g., cooperative or mixed-motive situations). This is highly relevant to complex LLM interactions, such as those involving multiple models and prompts.
- Focus on Distinguishing Tasks: The method identifies tasks that are most effective at differentiating between top-performing models, providing valuable insights for dataset curation and targeted model improvement.
- Promoting Holistic Improvement: By focusing on the strictest equilibrium in a multi-agent setting, deviation ratings encourage the development of more robust and generally capable LLMs rather than those over-specialized in narrow areas.