Can LLMs reliably search and reason like humans?
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
This paper introduces BLUR, a benchmark dataset for testing "tip-of-the-tongue" known-item retrieval by AI assistants. It focuses on complex, real-world scenarios requiring multi-hop reasoning, tool use, and handling of uncertainty across multimodal (text, image, audio, video) and multilingual inputs. Key points for LLM-based multi-agent systems: current LLMs and agents struggle with these queries, particularly with tool selection and orchestration; parametric knowledge alone is insufficient, especially for information not readily available online; and robust evaluation methodologies are crucial for measuring true progress in multi-agent AI capabilities, addressing issues like data contamination and the evolving nature of online tools.