Can LLMs collaboratively improve image captioning?
Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference
April 15, 2025
https://arxiv.org/pdf/2504.09620This paper introduces the Metropolis-Hastings Captioning Game (MHCG), a method for fusing the knowledge of multiple vision-language models (VLMs). Inspired by emergent communication research, MHCG lets two VLMs act as agents, taking turns generating image captions and learning from each other's outputs via a decentralized Bayesian inference process. This allows VLMs trained on different datasets to share knowledge without directly accessing each other's training data, improving performance on cross-dataset image captioning tasks.
Key points for LLM-based multi-agent systems:
- MHCG enables knowledge fusion between distinct LLMs without shared architectures or joint training, a common constraint in existing LLM ensemble methods.
- The decentralized, game-like interaction allows agents to learn from each other while mitigating catastrophic forgetting of their original knowledge.
- The Bayesian approach offers a more principled way to combine LLM outputs compared to ad-hoc methods like simple averaging.
- The concept of treating LLMs as agents in a communication game opens up new possibilities for multi-agent LLM application development.