Can LLMs play StarCraft II effectively using vision and language?
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method
March 10, 2025
https://arxiv.org/pdf/2503.05383This paper introduces VLM-Attention, a new approach for creating StarCraft II AI agents that behave more like humans. It uses vision and language models (VLMs) to allow agents to perceive the game through images and text descriptions, similar to human players, rather than relying on simplified data representations like traditional StarCraft II AI.
Key points for LLM-based multi-agent systems:
- Multimodal Input: Agents receive RGB images and natural language descriptions of the game state, closer to human perception.
- VLM-based Architecture: The agent architecture uses VLMs for strategic unit targeting (attention mechanism), tactical decision-making (knowledge integration via retrieval augmented generation), and dynamic role assignment for multi-agent coordination.
- Human-like Play: The system aims to replicate human cognitive processes in gameplay, making agent behavior more understandable and potentially leading to more effective human-AI collaboration in future.
- Emergent Capabilities: Without specific training, the agents demonstrated strategic unit prioritization, multi-unit synergy, adaptive tactical responses, and context-aware decision making.
- Limitations: Current limitations include spatial understanding and real-time control in complex scenarios.