How can we fairly serve diverse LLMs?
ENSURING FAIR LLM SERVING AMID DIVERSE APPLICATIONS
November 26, 2024
https://arxiv.org/pdf/2411.15997This paper introduces FAIRSERVE, a system for ensuring fair access to large language models (LLMs) in multi-tenant environments where diverse applications with varying resource needs coexist. It addresses the problem of some users or applications monopolizing LLM resources, making the service unavailable or slow for others.
Key points for LLM-based multi-agent systems:
- Multi-agent LLM interactions: FAIRSERVE explicitly recognizes and accounts for the fact that modern LLM applications often involve multiple interconnected LLM calls (agents) forming an "interaction graph" to generate a single user response. Existing fairness approaches typically don't account for this.
- Interaction-aware throttling: Instead of throttling individual LLM requests, FAIRSERVE throttles at the interaction level and only during system overload, minimizing wasted compute and incomplete user responses common with standard rate-limiting.
- Application-specific weights: FAIRSERVE uses a weighted service counter that considers application-specific characteristics (like expected input/output token lengths) to calculate the service received by users. This ensures fairness by accounting for the diverse resource demands of different applications.
- Reduced queueing delays: By prioritizing users in the middle of a multi-agent interaction, FAIRSERVE significantly reduces queueing delays and improves overall user experience.