How can an AR agent proactively help users with tasks?
YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks
January 17, 2025
https://arxiv.org/pdf/2501.09355This research introduces YETI (YET to Intervene), a framework for creating proactive AI assistants in augmented reality. Instead of only reacting to user requests, YETI anticipates user needs and intervenes proactively. It uses lightweight signals like changes in object counts and visual similarity between video frames to trigger interventions, making it suitable for resource-constrained AR devices.
Key points for LLM-based multi-agent systems:
- Proactivity: YETI shifts the paradigm from reactive to proactive AI, crucial for effective assistance, especially in complex or safety-critical tasks.
- Lightweight Computation: YETI uses efficient algorithms to process video and detect intervention opportunities, enabling real-time operation on AR devices even with limited computational power.
- Multimodal Integration: While tested with a lightweight VLM (PaliGemma) for object counting, YETI’s framework is designed to incorporate multiple modalities (e.g., hand pose, eye gaze) to enhance contextual understanding and intervention accuracy.
- Open-Source VLM Use: The use of PaliGemma demonstrates the feasibility of leveraging open-source models for complex multi-agent applications.
- AR Focus: YETI's design explicitly addresses the challenges and opportunities of AR interfaces, including egocentric perspectives and real-time interaction requirements.