How can mixed-quality human feedback improve MARL reward functions?
M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality
March 5, 2025
https://arxiv.org/pdf/2503.02077This paper introduces M³HF (Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality), a method for training AI agents in collaborative tasks where defining clear rewards is difficult. It addresses this by incorporating human feedback at different stages of the training process to refine the agents' behavior.
Key points for LLM-based multi-agent systems:
- Human feedback parsing: LLMs interpret human feedback (which can be varied in quality and detail) and translate it into structured rewards for the individual agents.
- Iterative refinement: Feedback is incorporated across multiple training "generations," allowing the system to progressively improve agent coordination based on human guidance.
- Reward function templates: LLMs use predefined templates to create new reward functions based on the parsed feedback, adding structure and efficiency to the process.
- Adaptive weighting: The system dynamically adjusts the importance of different reward components based on the observed improvement in agent performance, mitigating the impact of noisy or inaccurate feedback.
- VLM exploration: The paper explores using Vision-Language Models (VLMs) as an alternative feedback source, but finds current VLMs lacking in providing actionable suggestions.