How to best evaluate LLM-powered agents?
Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset
This paper introduces a framework for building agentic systems that can dynamically decompose complex tasks, integrate external tools, and adapt to changing conditions. It utilizes LLMs to process multi-hop queries, generate task graphs, select tools, and execute tasks (potentially in parallel). Novel metrics, including Node F1 Score, Structural Similarity Index (SSI), and Tool F1 Score, are introduced for evaluating these systems, alongside a specialized dataset based on AsyncHow. Results indicate that structural metrics like SSI are crucial for sequential tasks, while tool-related metrics dominate in parallel tasks. This research aims to improve the adaptability and reliability of agentic systems for process automation, especially for complex workflows.