Can multi-agent RL fine-tune LLMs better than PPO?
Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning
October 10, 2024
https://arxiv.org/pdf/2410.06101This research introduces CORY, a novel method for fine-tuning Large Language Models (LLMs) by framing the process as a sequential cooperative multi-agent reinforcement learning problem.
Instead of training a single LLM, CORY duplicates the LLM into a "pioneer" and "observer" that learn collaboratively. The key mechanisms are:
- Knowledge Transfer: The observer learns from both the user query and the pioneer's response, allowing it to better align with the task reward and avoid straying too far from the original LLM (distribution collapse).
- Role Exchange: The pioneer and observer periodically switch roles, preventing the observer from becoming overly reliant on the pioneer's output and ensuring both LLMs can operate independently.
Experiments demonstrate that compared to traditional RL fine-tuning (PPO), CORY achieves comparable or better performance with improved training stability, robustness to distribution collapse, and better balancing of task reward and KL divergence.