How to train cooperative agents offline with shifting data?
COMADICE: OFFLINE COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING with STATIONARY DISTRIBUTION SHIFT REGULARIZATION
This research paper addresses the challenges of applying offline reinforcement learning (learning from fixed datasets) to multi-agent systems, particularly in scenarios where agents must cooperate. The core problem is that the learned policy might stray too far from the data it was trained on, leading to poor performance.
The paper introduces ComaDICE, a novel algorithm that uses a technique called stationary distribution correction to ensure the learned policy remains consistent with the training data. This is particularly relevant for LLM-based multi-agent systems, as it allows developers to train agents on a fixed dataset of text interactions, preventing the agents from generating outputs that deviate too far from the style or content of the training data. ComaDICE also uses a clever value decomposition strategy that breaks down the complex multi-agent learning problem into smaller, more manageable sub-problems, making it suitable for large language models that require significant computational resources.