How can LLMs debate to evaluate each other?
ADVERSARIAL MULTI-AGENT EVALUATION OF LARGE LANGUAGE MODELS THROUGH ITERATIVE DEBATES
October 8, 2024
https://arxiv.org/pdf/2410.04663This paper proposes a new way to evaluate how well large language models (LLMs) perform using a multi-agent system inspired by courtrooms.
- Instead of relying only on humans or simple metrics, LLMs act as advocates, judges, and juries to debate and assess the quality of other LLMs' outputs.
- Two architectures are explored: MORE (multiple advocates, one round) and SAMRE (single advocate, multiple rounds with feedback), both showing promising results in experimental evaluations.