How can I route LLM requests efficiently with continuous learning?
Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models
This paper introduces Real-time Adapting Routing (RAR), a method for optimizing the use of multiple large language models (LLMs) of varying sizes and computational costs. RAR aims to reduce reliance on larger, more expensive LLMs by continually training a smaller LLM with "guides" (reasoning steps) generated by the larger LLM. This allows the smaller LLM to handle increasingly complex tasks over time, improving efficiency without significantly sacrificing performance. Key points for LLM-based multi-agent systems include the use of a layered architecture, continuous learning through in-context learning and guides, dynamic routing decisions based on real-time performance, and the potential for intra- and inter-domain generalization of learned knowledge.