How can we reliably evaluate LLMs without ground truth?
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
This paper introduces SAGEval, a novel framework for evaluating open-ended, reference-free text generated by LLMs, particularly focusing on complex formats like surveys and forms. It uses a two-agent system: an initial LLM evaluator grades the text based on predefined criteria, and a second "SAGE" agent critiques those scores, suggesting adjustments and even new evaluation criteria. This approach aims to improve automatic evaluation by mimicking expert review processes and reducing reliance on labeled reference data, which is often lacking in real-world applications. Key to LLM-based multi-agent systems is the SAGE agent's ability to refine initial evaluations, demonstrating a capacity for meta-evaluation and adaptation to open-ended text. This multi-agent setup provides a more nuanced and human-aligned assessment compared to single-agent evaluation.