How can I align multiple LLMs in a complex system?
Aligning Compound AI Systems via System-level DPO
This paper proposes a new method called System-level Direct Preference Optimization (SysDPO) to better align the output of multi-agent AI systems with human preferences. It addresses the challenge of aligning systems where components (like LLMs and image generators) interact in non-differentiable ways, making standard alignment techniques like DPO difficult to apply directly.
SysDPO models the multi-agent system as a Directed Acyclic Graph (DAG) representing the flow of information between components. This allows the system's output probability to be broken down and optimized with a modified DPO loss function, enabling end-to-end training and better alignment of all components with system-level preferences. The researchers demonstrate SysDPO's effectiveness by aligning an LLM and a diffusion model for a multi-image generation task, showing improved coherence and adherence to user instructions. The key takeaway for LLM-based multi-agent systems is that SysDPO offers a way to directly optimize for desired system behavior even when component interactions are non-differentiable.