
This page highlights research projects that have emerged from the MATS program, showcasing MATS fellows’ contributions to AI alignment, transparency, and security.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Read more
Authors:
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
Fellows:
Hoagy Cunningham
Date:
Sep 15, 2023
Towards Understanding Sycophancy in Language Models
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
Read more
Authors:
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
Fellows:
Meg Tong
Date:
Oct 20, 2023
Steering Language Models With Activation Engineering
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Read more
Authors:
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid
Fellows:
Lisa Thiergart, David Udell, Ulisse Mini
Date:
Aug 20, 2023
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Read more
Authors:
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
Fellows:
Daniel Tan
Date:
Feb 24, 2025
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
Read more
Authors:
Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
Fellows:
Iván Arcuschin Moreno
Date:
Feb 11, 2026
AI agents find $4.6M in blockchain smart contract exploits
AI models are increasingly good at cyber tasks, as we've written about before. But what is the economic impact of these capabilities? In a recent MATS and Anthropic Fellows project, our scholars investigated this question by evaluating AI agents' ability to exploit smart contracts on Smart CONtracts Exploitation benchmark (SCONE-bench)—a new benchmark they built comprising 405 contracts that were actually exploited between 2020 and 2025. On contracts exploited after the latest knowledge cutoff (March 2025), Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 developed exploits collectively worth $4.6 million, establishing a concrete lower bound for the economic harm these capabilities could enable. Going beyond retrospective analysis, we evaluated both Sonnet 4.5 and GPT-5 in simulation against 2,849 recently deployed contracts without any known vulnerabilities. Both agents uncovered two novel zero-day vulnerabilities and produced exploits worth $3,694, with GPT-5 doing so at an API cost of $3,476. This demonstrates as a proof-of-concept that profitable, real-world autonomous exploitation is technically feasible, a finding that underscores the need for proactive adoption of AI for defense.
Read more
Authors:
Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan Nicholas Carlini, Alwin Peng
Fellows:
Winnie Xiao
Date:
Dec 1, 2025
Interpreting Language Model Parameters
Neural networks use millions to trillions of parameters to learn how to solve tasks that no other machines can solve. What structure do these parameters learn? And how do they compute intelligent behavior?
Mechanistic interpretability aims to uncover how neural networks use their parameters to implement their impressive neural algorithms. Although previous work has uncovered substantial structure in the intermediate representations that networks use, little progress has been made to understand how the parameters and nonlinearities of networks perform computations on those representations.
In this work, we present a method that brings us closer to this understanding by decomposing a language model's parameters into subcomponents that each implement only a small part of the model's learned algorithm, while simultaneously requiring only a small fraction of those subcomponents to account for the network's behavior on any input.
The method, adVersarial Parameter Decomposition (VPD), optimizes for decompositions of neural network parameters into simple subcomponents that preserve the network's input-output behavior even when many subcomponents are ablated, including under ablations that are adversarially selected to destroy behavior. This encourages learning subcomponents that provide short, mechanistically faithful descriptions of the network's behavior that should aggregate appropriately into more global descriptions of the network's learned algorithm.
We study how sequences of interactions between these parameter subcomponents produce the network's output on particular inputs, enabling a new kind of 'circuit' analysis. While more work remains to be done to deepen our understanding of how neural networks use their parameters to compute their behavior, our work suggests an approach to identify a small set of simple, mechanistically faithful subcomponents on which further mechanistic analysis can be based.
Authors:
Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, Michael Ivanitskiy, Linda Linsefors, Lee Sharkey
Fellows:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Removing Sandbagging in LLMs by Training with Weak Supervision
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model's best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement, and SFT without RL fails to elicit full performance when the supervisor is much weaker than the untrusted model. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.
Authors:
Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar
Fellows:
Emil Ryd
Date:
May 1, 2026
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI R&D environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.
Authors:
Eyon Jang, Damon Falck, Joschka Braun, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner
Fellows:
Yeonwoo Jang, Damon Falck, Joschka Braun, Nathalie Kirch
Date:
Apr 30, 2026
What Should Frontier AI Developers Disclose About Internal Deployments?
Frontier AI developers are increasingly deploying highly capable models internally to automate AI R&D, but these deployments currently face limited external oversight. It is essential, therefore, that developers provide evidence that internally deployed models are safe. While recent work has highlighted the risks of internal deployments and proposed broad approaches to transparency and governance, there remains little guidance on the specific information developers should disclose about them. We address this gap by identifying key information that companies should disclose about internally deployed models across four categories: capabilities, usage, safety mitigations, and governance. For each category, we analyse the key benefits and limitations of disclosure and consider how disclosure-related risks can be mitigated. Our framework could be used by developers to inform both public transparency documents, such as model system cards, and private periodic reports required under emerging frontier AI regulation.
Authors:
Jacob Charnock, Raja Mehta Moreno, Justin Miller, William L. Anderson
Fellows:
Jacob Charnock, Raja Moreno, Justin Miller, William L. Anderson
Date:
Apr 24, 2026
Where is the Mind? Persona Vectors and LLM Individuation
The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.
Authors:
Pierre Beckmann, Patrick Butlin
Fellows:
Pierre Beckmann
Date:
Apr 20, 2026
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.
Authors:
Advait Yadav, Sid Black, Oliver Sourbut
Fellows:
Advait Yadav
Date:
Apr 9, 2026
Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity
Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to large language models (LLMs) during training, and both defend the LLM against acquiring the trait. The surprising success of these methods comes with the question: how do they work? Are PPS and IP doing the same thing? We provide behavioral and mechanistic comparisons of these two methods using "evilness" as a case-study trait. Our central finding is that PPS and IP achieve their defensive benefits through distinct mechanisms. Behaviorally, we show that neither PPS nor IP operates through a purely associative mechanism; and PPS can both defend against trait acquisition and actively reduce pre-existing expression, whereas IP is ineffective in models that were previously finetuned to express the trait. This behavioral divergence is reflected mechanistically: PPS shifts the activation gradient towards an attenuating direction along the PPS vector axis. When the PPS vector is aligned with a trait-expressing axis, it can reverse the gradient pressure, reducing rather than increasing activation along that axis. In contrast, IP continues to resist a precise mechanistic account. Direct cosine similarity analyses reveal that IP has a characteristically different gradient signature than PPS, and qualitative analyses reveal IP's gradient to be more diffuse. Furthermore, IP reduces the next-token prediction loss on trait-expressing data where PPS need not, consistent with the notion that IP "explains away" the trait-expression in the training data. Taken together, our analyses reveal distinct mechanisms by which each method operates and highlight open questions about IP's mechanistic picture.
Authors:
Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath
Fellows:
Satchel Grant, Victor Gillioz, Jake Ward
Date:
Apr 3, 2026
Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering. We show that an autoresearch-style pipeline powered by Claude Code discovers novel white-box adversarial attack algorithms that significantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations.
Starting from existing attack implementations, such as GCG, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to ≤10% for existing algorithms. The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving 100% ASR against Meta-SecAlign-70B versus 56% for the best baseline. Extending the findings of AutoAdvExBench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at this https URL.
Authors:
Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko
Fellows:
Alexander Panfilov
Date:
Mar 25, 2026
Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection
As language models grow more capable, accurate capability evaluation becomes essential for safety decisions. If models can deliberately underperform on dangerous capability evaluations---a behavior known as \emph{sandbagging}---they may evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C³), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about \emph{how} to appear weak create detectable inconsistencies. Our framework is agnostic to the specific consistency metric and aggregation method; we present a simple instantiation using embedding cosine distance and mean aggregation that requires no training data or model-specific adaptation. We evaluate C³ across prompted and fine-tuned sandbagging scenarios, across instructed and more naturalistic setups, maintaining a classification signal where other black-box methods fail. Our findings show the limitations of existing sandbagging detection methods, and reveal the efficacy of consistency-checking as a detection mechanism for dangerous capabilities.
Authors:
Fellows:
Yulong Lin, Benjamin Arnav
Date:
Mar 11, 2026
Diff Mining: Logit Differences Reveal Finetuning Objectives
Finetuning has become the gold standard for refining existing behaviors and inducing new ones in language models, yet it often remains unclear exactly which behaviors emerge during this process. As models grow ever more capable, understanding finetuning better becomes increasingly important, particularly since unwanted behaviors may arise during finetuning. In this paper, we introduce Diff Mining, a simple yet effective framework for identifying what a finetuned model has learned by comparing its logits to those of its base model. Diff Mining effectively surfaces salient tokens that are amplified or suppressed in the finetuned model, serving as a fingerprint of its training---even on text unrelated to the finetuning domain. Unlike many existing model diffing methods which require model internals, Diff Mining only needs access to output logits and scales to large models. The framework consists of two modular stages: (i) extracting per-context logit differences between the finetuned and base models on a reference corpus, and (ii) aggregating the resulting signals to construct an interpretable token set representing the finetune. For aggregation, we explore both a simple Top-K frequency method and a Non-negative Matrix Factorization (NMF)-based approach for disentangling multiple finetuning objectives into distinct token clusters. Empirically, Diff Mining succeeds across diverse settings: on finetune domain detection, it significantly outperforms state-of-the-art model diffing methods both in identifying relevant tokens and in downstream performance when an interpretability agent is given access to the extracted token set; on models with injected biases, it identifies more than one third of the biases without targeted probing. Overall, our framework shows promise in developing auditing tools to detect finetuning objectives.
Authors:
Greg Kocher, Robert West, Clément Dumas, Julian Minder
Fellows:
Julian Minder, Clément Dumas
Date:
Mar 11, 2026
SL5 Standard for AI Security
Security Level 5 (SL5) is a security posture for AI systems that could plausibly thwart top-priority operations by the world's most cyber-capable institutions: those with extensive resources, state-level infrastructure, and expertise years ahead of the public state of the art. The SL5 terminology originates from the RAND Corporation's 2024 report "Securing AI Model Weights" [1].
This first revision of the SL5 standard focuses on requirements with long lead times: interventions that must be planned years in advance, such as facility construction, hardware procurement, and organizational capability development. We prioritize these requirements because preserving optionality for SL5 by 2028/2029 requires starting now. These capabilities cannot be retrofitted on short notice when the need becomes urgent. Some requirements represent significant departures from current day standard practice. We believe bold measures are necessary for this level of security and see clear opportunities to apply optimization pressure to existing and novel solutions to customize them for the AI industry and address the practical operational requirements as much as possible. Our organization exists to begin paving this path. Some requirements approximate government security capabilities where private-sector approaches may be insufficient. We identify these gaps and note where government involvement may ultimately be necessary.
This standard was developed collaboratively with frontier AI laboratories, government partners, and security experts through sustained engagement over several months. As version 0.1, significant refinement is expected through continued stakeholder engagement. We explicitly invite frontier AI labs, government agencies, datacenter operators, and security researchers to engage with this work, whether through direct collaboration, feedback, or implementation experience. Please reach out through info@sl5.org.
The control specifications in this standard are structured as an overlay on NIST SP 800-53. We chose this approach for three reasons. First, NIST SP 800-53 is a battle-tested framework and the standard choice for high-security organizations. Second, structuring as an overlay enables ease of adoption for organizations already implementing NIST controls. Third, the overlay format clearly expresses the "diff" from existing baselines, highlighting what is new or different for SL5 rather than restating established requirements.
Many other security controls are necessary for SL5, including most controls from existing high-security baselines. Future revisions will provide detailed mapping from DoD Impact Level 6 (IL6) and its reference frameworks (FedRAMP High, CNSSI 1253) to SL5 requirements [4], [5]. Physical security requirements draw on ICD 705 SCIF standards as a basis [6], [7], [22], [23]. Hardware supply chain requirements reference NIST SP 800-161 Rev 1 [3].
Authors:
Lisa Thiergart, Yoav Tzfati, Peter Wagstaff, Guy, Luis Cosio, Philip Reiner
Fellows:
Yoav Tzfati
Date:
Mar 10, 2026
When “Just Read the Chain of Thought” Fails: Five Tasks for Stress-Testing CoT Monitors
Chain-of-thought (CoT) reasoning is now a standard feature of frontier language models, and monitoring CoT is one of the best methods we currently have for detecting model misbehavior. Indeed, existing monitoring benchmarks are saturated by current LLMs: simply running LLMs on the CoT often suffices for simple detection tasks. However, we believe that more powerful CoT tools would give us even better insight and control over language model behavior. To encourage the development of monitoring tools that go beyond text-based inspection, we introduce five objective tasks designed to stress-test CoT monitors: predicting reasoning termination, detecting self-destructive behavior, estimating forced answer entropy, identifying sycophancy, and flagging atypical answers. Our results demonstrate that while "artisanal" methods trained on task-specific data can achieve strong performance, "off-the-shelf" methods and standard LLM monitors struggle to generalize. By providing these datasets and baseline results, we introduce a testbed for developing more robust monitoring techniques.
Authors:
Fellows:
Daria Ivanova, Riya Tyagi
Date:
Mar 5, 2026
AI Researchers' Views on Automating AI R&D and Intelligence Explosions
Many leading AI researchers expect AI development to exceed the transformative impact of all previous technological revolutions. This belief is based on the idea that AI will be able to automate the process of AI research itself, leading to a positive feedback loop. In August and September of 2025, we interviewed 25 leading researchers from frontier AI labs and academia, including participants from Google DeepMind, OpenAI, Anthropic, Meta, UC Berkeley, Princeton, and Stanford to understand researcher perspectives on these scenarios. Though AI systems have not yet been able to recursively improve, 20 of the 25 researchers interviewed identified automating AI research as one of the most severe and urgent AI risks. Participants converged on predictions that AI agents will become more capable at coding, math and eventually AI development, gradually transitioning from `assistants' or `tools' to `autonomous AI developers,' after which point, predictions diverge. While researchers agreed upon the possibility of recursive improvement, they disagreed on basic questions of timelines or appropriate governance mechanisms. For example, an epistemic divide emerged between frontier lab researchers and academic researchers, the latter of which expressed more skepticism about explosive growth scenarios. Additionally, 17/25 participants expected AI systems with advanced coding or R&D capabilities to be increasingly reserved for internal use at AI companies or governments, unseen by the public. Participants were split as to whether setting regulatory ``red lines" was a good idea, though almost all favored transparency-based mitigations.
Authors:
Severin Field, Raymond Douglas, David Krueger
Fellows:
Severin Field
Date:
Mar 5, 2026
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
Authors:
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
Fellows:
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli
Date:
Mar 5, 2026
Reasoning Models Struggle to Control their Chains of Thought
Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability — CoT controllability — we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word “chromosome”. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.
Authors:
Chen Yueh-Han, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
Fellows:
John Chen, Robert McCarthy, Bruce Lee
Date:
Mar 5, 2026
Frontier Models Can Take Actions at Low Probabilities
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.
Authors:
Alex Serrano, Wen Xing, David Lindner, Erik Jenner
Fellows:
Alex Serrano, Wen Xing
Date:
Mar 2, 2026
Where does Olmo get its values?
Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its values. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman ) changes in values without training, using only the dataset, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate where values change during training and predict how they will change from just training data; we are excited about future work that further investigates such questions.
Authors:
Fellows:
Lily Sun
Date:
Mar 1, 2026
Mitigating Reward Hacking with RL Training Interventions
Reinforcement learning (RL) is central to LLM post-training, but reward functions are imperfect incentives for desired behavior and models often reward hack by exploiting loopholes in reward design. Reward hacking undermines the trustworthiness of the training process and has even been shown to generalize to broader misalignment. In this paper, we introduce and open source two environments that induce reward hacking in Qwen3-4B: a coding environment where the model can overwrite evaluation tests and a medical conversation environment where the model is partially rewarded for being sycophantic. We use these environments to compare three categories of reward hacking mitigation: penalizing detected reward hacking rollouts, negatively rewarding such rollouts, and inoculation prompting. Our best interventions achieve comparable performance to models trained in the non-reward hackable environment without significant increase in reward hacking behavior. Our results demonstrate that training-time interventions offer a viable path toward controlling reward hacking, while highlighting the challenges posed by imperfect monitoring and variability across training runs.
Authors:
Fellows:
Aria Wong
Date:
Mar 1, 2026
Expert Selections In MoE Models Reveal (Almost) As Much As Text
We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
Authors:
Fellows:
Amir Nuriyev
Date:
Mar 1, 2026
Measuring Control Intervention Awareness Across Frontier LLMs
AI control protocols provide a framework to oversee untrusted models by intervening on potentially malicious actions by editing, resampling, or replacing them. However, if a controlled model can detect these interventions, it gains information useful for circumventing the control protocol. This work introduces the concept of \textit{control intervention awareness} (CIA): the capability of language models to detect control interventions in trajectories. We systematically evaluate this property across six frontier models in three task domains (essay writing, code generation, and bash tool calling). Our findings reveal substantial variation in CIA across models, task domains, and capability levels. Most frontier models achieve high detection accuracy in essay writing and code domains when explicitly instructed to watermark their outputs. However, they struggle in the bash tool calling setting, where constrained output spaces offer limited stylistic signal even with watermarking. Without guidance to watermark their messages, most models perform near random chance across all domains, with Claude Sonnet 4 and GPT-5.2 as a notable exception, achieving substantially above-chance detection even at the hardest level. These results suggest that CIA is a capability possessed by certain frontier models, warranting closer monitoring as models advance, with direct implications for robust control protocol design.
Authors:
Fellows:
Alexander Panfilov
Date:
Feb 28, 2026
The MATS Program is a 10-week research fellowship designed to train and support emerging researchers working on AI alignment, transparency and security. Fellows collaborate with world-class mentors, receive dedicated research management support, and join a vibrant community in Berkeley focused on advancing safe and reliable AI. The program provides the structure, resources, and mentorship needed to produce impactful research and launch long-term careers in AI safety.
MATS mentors are leading researchers from a broad range of AI safety, alignment, governance, field-building and security domains. They include academics, industry researchers, and independent experts who guide scholars through research projects, provide feedback, and help shape each scholar’s growth as a researcher. The mentors represent expertise in areas such as:
Key dates
Application:
The main program will then run from September 28th to December 4th, with the extension phase for accepted fellows beginning in December.
MATS accepts applicants from diverse academic and professional backgrounds - from machine learning, mathematics, and computer science to policy, economics, physics, cognitive science, biology, and public health, as well as founders, operators, and field-builders without traditional research backgrounds. The primary requirements are strong motivation to contribute to AI safety and evidence of technical aptitude, research potential, or relevant operational experience. Prior AI safety experience is helpful but not required.
Applicants submit a general application, applying to various tracks (Empirical, Theory, Strategy & Forecasting, Policy & Governance, Systems Security, Biosecurity, Founding & Field-Building.
In stage 2, applicants apply to streams within those tracks as well as completing track specific evaluations.
After a centralized review period, applicants who are advanced will then undergo additional evaluations depending on the preferences of the streams they've applied to before doing final interviews and receiving offers.
For more information on how to get into MATS, please look at this page.