How can I build truly versatile AI agents?
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
November 19, 2024
https://arxiv.org/pdf/2411.10943This paper surveys Generalist Virtual Agents (GVAs), autonomous agents designed to operate across diverse digital platforms and environments, fulfilling user needs by executing a variety of tasks. It traces GVA development from early intelligent assistants to those incorporating large language models (LLMs).
Key points for LLM-based multi-agent systems include:
- GVA Architecture: GVAs comprise an environment (web, application, or operating system), task (command, query, or dialogue), observation space (CLI, DOM, screen), and action space (keyboard, mouse, touchscreen).
- Model Types: GVAs utilize various model architectures including retriever-based, LLM-based, Multimodal LLM (MLLM)-based, and Vision-Language-Action (VLA)-based agents. MLLMs are highlighted as leading the evolution of agents due to their ability to handle diverse data modalities.
- Training Strategies: Key training methods include adaption (prompting, feedback, memory), fine-tuning, reinforcement learning, and cooperation/competition strategies.
- Evaluation Methods: Evaluation focuses on overall success, detailed step-wise analysis, human assessment, and MLLM-based evaluation.
- Limitations: Current GVAs face challenges in terms of unrealistic training data and environments, limited transferability between platforms, difficulties in long-sequence decision-making, and security concerns.
- Future Directions: The paper advocates for a shift from individual agents to systematic multi-agent frameworks integrated into operating systems. It also highlights the potential of embodied intelligence, where agents interact with the physical world through sensors and actuators.