Can AI agents reliably reproduce scientific research?
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
September 18, 2024
https://arxiv.org/pdf/2409.11363This paper introduces CORE-Bench, a benchmark designed to evaluate the ability of AI agents to automatically reproduce the results of scientific papers. The benchmark focuses on computational reproducibility, a crucial aspect of scientific research.
Key points relevant to LLM-based multi-agent systems:
- The paper investigates whether general-purpose language models (LLMs) like AutoGPT, can be adapted to the specialized task of computational reproducibility.
- Findings indicate that task-specific modifications to LLM-based agents significantly enhance their accuracy in reproducing research results.
- Despite improvements with task-specific prompting, the best performing agent still achieved only 21% accuracy on the most challenging tasks, highlighting the need for further research in this area.
- The paper emphasizes the potential of adaptable, LLM-based agents for automating complex scientific tasks and ultimately contributing to automating scientific research.