How can I prevent undesirable AI agent behavior?
Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents
This paper introduces "strategy masking," a method to control the behavior of reinforcement learning agents by manipulating their reward functions. It decomposes rewards into separate dimensions (e.g., winning, lying, helping) and uses a "mask" to selectively activate or suppress these dimensions during and after training. This allows developers to encourage or discourage specific behaviors without retraining.
The key takeaway for LLM-based multi-agent systems is the potential to fine-tune agent behavior by adjusting reward components. This offers a way to mitigate undesirable learned behaviors (like LLM "hallucinations" viewed as lying) or promote cooperation in multi-agent settings by rewarding helpful actions. This method, applied post-training, can dynamically adjust agent priorities without requiring further computationally expensive training.