How can we measure AI-human teamwork effectiveness?
Who is Helping Whom? Analyzing Inter-dependencies to Evaluate Cooperation in Human-AI Teaming
This paper explores how to effectively measure cooperation in human-AI teams, especially where the AI is trained using Multi-Agent Reinforcement Learning (MARL). Current metrics focus on task completion, ignoring how the team interacts. The researchers propose "interdependence" – the degree to which agents rely on each other's actions – as a better measure of cooperation. They use a symbolic representation of the environment and actions to track and quantify interdependencies. Experiments with state-of-the-art MARL agents in the Overcooked game environment, paired with a learned human model, reveal that these agents don't cooperate effectively with humans, even when achieving high task performance. This highlights a misalignment between current MARL training and true cooperation, suggesting a need for new training paradigms and more complex environments to evaluate multi-agent systems.
Key points for LLM-based multi-agent systems: This work emphasizes the need for better evaluation metrics beyond simple task success, which is especially critical with LLMs. The concept of "interdependence" offers a valuable perspective for assessing LLM agent collaboration. The use of symbolic representations to analyze agent interactions could be adapted to LLMs, providing insights into the reasoning behind cooperative or non-cooperative behaviors. Finally, the findings highlight the current limitations of MARL training for fostering true cooperation, a challenge that also applies to LLM-based multi-agent systems. This suggests a need for research on training paradigms that explicitly incentivize interdependent behaviors in LLM agents.