Can prompt attacks break multi-agent LLMs?
Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
April 2, 2025
https://arxiv.org/pdf/2504.00218This paper explores vulnerabilities in multi-agent LLM systems, specifically how carefully crafted prompts ("attacks") can be spread through a network of LLMs to bypass safety measures and cause an LLM to generate harmful content (a "jailbreak"). It introduces a method to optimize these attacks, considering network limitations like bandwidth and latency, and making them resistant to the order the prompt pieces arrive. Experiments show this method is significantly more effective than simpler attacks, highlighting a critical weakness in current multi-agent LLM safety mechanisms. The research primarily focuses on text-based agents and assumes partial knowledge of the network structure.