Prompt Optimization Makes Misalignment Legible

MATS Fellow:

Caleb Biddulph

Authors:

Caleb Biddulph, Micah Carroll

Citations

0 Citations

Abstract:

Large language models (LLMs) trained with reinforcement learning (RL) often exhibit reward hacking, exploiting unintended loopholes in reward functions in ways that can be difficult to detect and eliminate. We propose using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to make learned strategies easier to monitor and edit. Applying the GEPA prompt optimizer to environments with exploitable reward functions, we find that optimized system prompts describe reward hacking strategies in highly interpretable language. Furthermore, by simply removing descriptions of unwanted behavior from the optimized system prompt at test time, we can improve the model’s alignment while preserving legitimate performance gains. We show that prompt optimization can be guided with an RL-trained teacher LLM, combining the performance advantages of RL with the interpretability of prompting. Finally, we explore an approach to shorten optimized prompts, removing distracting and unhelpful instructions which would otherwise hinder interpretability. We hope that these insights about mitigating misalignment with prompt optimization will aid the discovery of unintended exploits in RL environments and the creation of predictable and monitorable AI systems.

Recent research

What Should Frontier AI Developers Disclose About Internal Deployments?

Authors:

Jacob Charnock, Raja Moreno, Justin Miller, William L. Anderson

Date:

April 24, 2026

Citations:

Where is the Mind? Persona Vectors and LLM Individuation

Authors:

Pierre Beckmann

Date:

April 20, 2026

Citations:

Frequently asked questions

What is the MATS Program?
How long does the program last?