Prompt Optimization Makes Misalignment Legible

MATS Fellow:

Caleb Biddulph

Authors:

Caleb Biddulph, Micah Carroll

Citations

0 Citations

Abstract:

Large language models (LLMs) trained with reinforcement learning (RL) often exhibit reward hacking, exploiting unintended loopholes in reward functions in ways that can be difficult to detect and eliminate. We propose using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to make learned strategies easier to monitor and edit. Applying the GEPA prompt optimizer to environments with exploitable reward functions, we find that optimized system prompts describe reward hacking strategies in highly interpretable language. Furthermore, by simply removing descriptions of unwanted behavior from the optimized system prompt at test time, we can improve the model’s alignment while preserving legitimate performance gains. We show that prompt optimization can be guided with an RL-trained teacher LLM, combining the performance advantages of RL with the interpretability of prompting. Finally, we explore an approach to shorten optimized prompts, removing distracting and unhelpful instructions which would otherwise hinder interpretability. We hope that these insights about mitigating misalignment with prompt optimization will aid the discovery of unintended exploits in RL environments and the creation of predictable and monitorable AI systems.

Recent research

The 2025 AI Agent Index

Authors:

Leon Staufer, Mick Yang

Date:

February 20, 2026

Citations:

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

Authors:

David Chanin

Date:

February 16, 2026

Citations:

Prompt Optimization Makes Misalignment Legible

Recent research

Frequently asked questions