How transferable are adversarial attacks on shared backbones?
With Great Backbones Comes Great Adversarial Transferability
This paper investigates the vulnerability of publicly available pre-trained image classification models (like ResNet and ViT) to adversarial attacks, particularly when fine-tuned for downstream tasks. It introduces a "grey-box" attack scenario where the attacker has partial knowledge of the target model's training process, including the backbone architecture and potentially other meta-information such as dataset or fine-tuning techniques. A novel "backbone attack" is also presented, leveraging only the pre-trained feature extractor to generate adversarial examples. Results show that these grey-box attacks, including the simplistic backbone attack, often surpass strong black-box methods and approach white-box attack effectiveness, raising concerns about the security risks of sharing pre-trained models.
While not explicitly about multi-agent systems, this research is relevant to LLM-based multi-agent systems in that it highlights vulnerabilities of foundational models (like pre-trained vision or language models) that could be exploited by malicious agents. Specifically, it demonstrates how even limited knowledge of an agent's underlying model architecture can be leveraged to create adversarial inputs that manipulate its behavior. This emphasizes the need for careful consideration of security risks and potential mitigation strategies, particularly when deploying agents based on shared or publicly available LLMs. Just as pre-trained vision backbones can be exploited, so too could LLMs acting as agents be vulnerable to specifically crafted prompts or inputs designed to trigger unintended actions.