How can LLMs best collaborate in complex tasks?
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
February 28, 2025
https://arxiv.org/pdf/2502.20073This paper introduces Collab-Overcooked, a benchmark for evaluating how well large language models (LLMs) can work together as agents in a multi-agent system. It uses a modified version of the Overcooked-AI game where two LLM agents, a chef and an assistant, must communicate and coordinate to complete cooking tasks.
Key points for LLM-based multi-agent systems:
- Collaboration definition: The benchmark defines collaboration as the ability to both initiate requests for help and respond effectively to those requests.
- Forced Collaboration and Asymmetric Knowledge: Agents operate in isolated environments with different capabilities and only one agent knows the recipe, forcing them to collaborate through natural language.
- Process-Oriented Evaluation: Instead of just measuring task completion, Collab-Overcooked introduces new metrics (TES and ITES) to assess the process of collaboration, looking at the efficiency and correctness of individual actions within a task sequence.
- Scalability and Bottlenecks: Experiments with various LLMs reveal that while larger models generally perform better, all models struggle with increasingly complex tasks, revealing a bottleneck in maintaining consistent collaboration and adapting to dynamic situations. Initiating collaboration is a bigger challenge than responding.
- Task Decomposition, Context Tracking: Task decomposition influences performance, but does not fully explain the decline in collaboration as tasks get harder. LLMs also exhibit a strong positional dependence in their ability to execute action sequences, suggesting limited context tracking abilities.