How to rank agents using noisy performance data?
Soft Condorcet Optimization for Ranking of General Agents
This paper introduces Soft Condorcet Optimization (SCO), a new method for ranking AI agents based on their performance across different tasks or against each other. It addresses the problem of incomplete data, common in agent evaluation, by using a differentiable loss function inspired by voting theory. SCO aims to find the ranking that minimizes disagreements with observed performance comparisons (votes).
For LLM-based multi-agent systems, SCO offers a way to rank and compare LLMs based on their performance across diverse benchmarks, even with incomplete data. It can potentially identify the best-performing LLM in a group by aggregating results from various tasks, similar to how voting systems elect a winner. The method is differentiable, which may prove useful for training or fine-tuning LLMs based on their relative performance rankings. It also offers an online version for updating rankings as new evaluation data becomes available, which could be beneficial for continuously evolving LLM agents.