MATS Fellow:
Caleb Biddulph
Authors:
Caleb Biddulph, Micah Carroll
Citations
Abstract:
Large language models (LLMs) trained with reinforcement learning (RL) often exhibit reward hacking, exploiting unintended loopholes in reward functions in ways that can be difficult to detect and eliminate. We propose using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to make learned strategies easier to monitor and edit. Applying the GEPA prompt optimizer to environments with exploitable reward functions, we find that optimized system prompts describe reward hacking strategies in highly interpretable language. Furthermore, by simply removing descriptions of unwanted behavior from the optimized system prompt at test time, we can improve the model’s alignment while preserving legitimate performance gains. We show that prompt optimization can be guided with an RL-trained teacher LLM, combining the performance advantages of RL with the interpretability of prompting. Finally, we explore an approach to shorten optimized prompts, removing distracting and unhelpful instructions which would otherwise hinder interpretability. We hope that these insights about mitigating misalignment with prompt optimization will aid the discovery of unintended exploits in RL environments and the creation of predictable and monitorable AI systems.
What Should Frontier AI Developers Disclose About Internal Deployments?
Authors:
Jacob Charnock, Raja Moreno, Justin Miller, William L. Anderson
Date:
April 24, 2026
Citations:
Where is the Mind? Persona Vectors and LLM Individuation
Authors:
Pierre Beckmann
Date:
April 20, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.