Can VLMs improve robot behavior prediction iteratively?
TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models
This paper introduces TRACE (Tree-of-thought Reasoning And Counterfactual Exploration), a framework for predicting the behavior of other agents (e.g., robots, vehicles) in a shared environment, even with limited observations. It uses a Vision-Language Model (VLM) to generate multiple possible future trajectories (hypotheses) for the target agent. A "critic" component identifies edge cases and unusual but valid maneuvers that the VLM might miss. These are fed back to the VLM in an iterative loop, allowing the VLM to "learn" and refine its predictions over time without explicit retraining. This self-improving process makes the VLM more robust and better at anticipating both common and uncommon behaviors. The research demonstrates this approach in both simulated autonomous driving scenarios and real-world marine navigation with autonomous surface vehicles.
Key points for LLM-based multi-agent systems:
- Iterative Refinement: TRACE shows how iterative feedback, including counterfactual examples, can significantly improve the performance of LLMs in multi-agent scenarios.
- Edge Case Handling: The critic helps LLMs overcome the tendency to focus on common behaviors and consider a wider range of possibilities, crucial for robust interaction.
- Self-Improvement: LLMs within TRACE demonstrate the ability to learn from experience during inference itself, becoming more effective at predicting agent behavior over time without needing retraining.
- Bridging Perception and Reasoning: The framework effectively combines visual information with domain knowledge and reasoning capabilities of LLMs to anticipate agent actions.
- Practical Applicability: The successful implementation in both simulation and real-world robotic systems highlights the potential of this approach for developing more robust and adaptive multi-agent systems.