MATS Fellow:
Qiyao Wei
Authors:
Matt MacDermott, Qiyao Wei, Rada Djoneva, Francis Rhys Ward
Citations
Abstract:
AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.
What Should Frontier AI Developers Disclose About Internal Deployments?
Authors:
Jacob Charnock, Raja Moreno, Justin Miller, William L. Anderson
Date:
April 24, 2026
Citations:
Where is the Mind? Persona Vectors and LLM Individuation
Authors:
Pierre Beckmann
Date:
April 20, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.