How can I efficiently learn multi-agent rewards from demos?
Multi-Agent Inverse Q-Learning from Demonstrations
This paper introduces MAMQL (Multi-Agent Marginal Q-Learning from Demonstrations), a new algorithm for training AI agents in scenarios with multiple agents where cooperation and competition are both present. It addresses the problem of figuring out what motivates each agent based only on examples of expert behavior, especially when those motivations are complex and intertwined. Instead of training agents to directly copy expert actions, MAMQL attempts to learn the underlying reward functions that drive expert decision-making. It does this by considering how each agent's actions affect the others and by learning a simplified "marginalized" view of the environment for each agent.
Key points relevant to LLM-based multi-agent systems include the ability to learn from expert demonstrations, the focus on mixed cooperative/competitive scenarios, and the potential for scaling to more complex environments with multiple agents. The focus on reward function recovery could enable LLMs to infer complex objectives and conventions from demonstration data, which could be valuable in developing more robust and adaptable multi-agent systems. The marginalization technique also offers a potential avenue for managing the computational complexity of coordinating multiple LLM agents.