Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

MATS Fellow:

Oliver Daniels

Authors:

Perusha Moodley, Benjamin M. Marlin, David Lindner

Citations

Abstract:

Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.

Recent research

Building Comparative Motivation Profiles with Instrumental Interventions

Authors:

David Vella Zarb, Rustem Turtayev, Taywon Min

Date:

June 6, 2026

Citations:

Covert Influence Between Language Models

Authors:

Jay Chooi, Avi Shah

Date:

June 2, 2026

Citations:

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

Recent research

Frequently asked questions