How can LLMs improve medical diagnosis using multi-agent dialogue?
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
April 22, 2025
https://arxiv.org/pdf/2504.13861This paper introduces 3MDBench, a new framework for testing how well LLMs perform as AI doctors in simulated telemedicine appointments. It uses multiple AI agents: a doctor LLM, a patient LLM with a simulated personality (sanguine, choleric, melancholic, or phlegmatic), and a judge LLM that scores the doctor's performance.
Key points for LLM-based multi-agent systems:
- Personality matters: Patient personality affects dialogue length and can influence doctor LLM performance, even though the diagnostic accuracy remains stable.
- Dialogue and visuals help: Allowing the doctor LLM to have a conversation and see an image of the patient's issue improves diagnostic accuracy.
- Rationale isn't always key: Forcing the doctor LLM to explain its reasoning doesn't improve performance in this medical setting.
- Small expert models boost performance: Giving the doctor LLM access to predictions from a smaller, specialized image-recognition model significantly improves its diagnoses.
- Realism is crucial: Simulating realistic patient behavior and giving the doctor LLM multimodal information is key for effective evaluation of medical LLMs.