Can LLMs reliably code algorithms from research papers?
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
April 2, 2025
https://arxiv.org/pdf/2504.00255This paper introduces SciReplicate-Bench, a benchmark to test how well LLMs can reproduce code from algorithm descriptions in research papers. It also presents Sci-Reproducer, a multi-agent system where a "Paper Agent" extracts information from the paper and a "Code Agent" searches code repositories and implements the algorithm. Key findings are that current LLMs struggle with this task, often overthinking and failing to utilize available tools effectively, highlighting the need for better integration of external knowledge and code comprehension in multi-agent LLM systems.