How can we reliably detect dangerous AI capabilities?
QUANTIFYING DETECTION RATES FOR DANGEROUS CAPABILITIES: A THEORETICAL MODEL OF DANGEROUS CAPABILITY EVALUATIONS
This paper proposes a mathematical model to quantify the effectiveness of "dangerous capability testing" for AI systems, particularly focusing on how well these tests can estimate the lower bound of an AI system's potential for dangerous capabilities. It explores how biases in these estimations can emerge and how delays in detecting the crossing of critical danger thresholds can occur. Key issues include the difficulty of testing for more severe dangers, competitive pressures leading to underinvestment in safety testing, and the potential for AI systems to strategically underperform or become deceptive during evaluations. The research highlights the importance of balanced investment in both high-severity tests and those close to the estimated frontier of AI capabilities for effective risk management. While not explicitly focused on multi-agent systems, the core concepts related to bias, detection lags, and strategic behavior are highly relevant to LLM-based multi-agent systems, especially when considering emergent risks arising from agent interaction and coordination. The concepts presented here could be applied to analyze how effectively multi-agent systems' dangerous capabilities are being monitored as they scale.