The MATS Program is an independent research and educational seminar program that connects talented researchers with top mentors in the fields of AI alignment, transparency, and security. The program runs for 12 weeks with in-person cohorts in Berkeley and London, where MATS fellows conduct research while attending talks, workshops, and networking events with other members of the AI research community. Top-performing fellows can extend their impactful research for an additional 6 months with continued funding, mentorship, and community support.














Since late 2021, over 446 researchers have trained through MATS, producing 160+ research papers, joining leading AI labs, and founding new organizations driving progress in AI alignment, transparency, and security.
In the past 4 years, we have helped produce more than 160 research publications with over 9,000 collective citations; our organizational h-index is 43.
MATS fellows have helped develop new research agendas, including sparse auto-encoders for AI interpretability, activation/representation engineering, emergent misalignment, inoculation prompting, developmental interpretability, computational mechanics, glitch token analysis, evaluating situational awareness, gradient routing, externalized reasoning oversight, conditioning predictive models, formalizing natural abstractions, and more!
10% of alumni have co-founded AI safety organizations or research teams during or after MATS.
MATS alumni-founded organizations include Aether, AI Safety Argentina, Apollo Research, ARENA, Athena, Atla, Cadenza Labs, Catalyze Impact, Center for AI Policy, Contramont Research, Coordinal Research, Decode Research, Freestyle Research, Fulcrum, Groundless, Leap Labs, LISA, Luthien Research, Poseidon Research, PRISM Eval, Simplex, SL5 Taskforce, StakeOut AI, Timaeus, Theorem Labs, Watertight AI, WeaveMind, and Workshop Labs.
80% of alumni are now working in AI alignment, transparency, and security.
MATS alumni have been hired by leading organizations like Anthropic, Google DeepMind, OpenAI, Meta AI, UK AISI, Redwood Research, METR, RAND CAST, Coefficient Giving, ARC, FAR.AI, Apollo Research, Truthful AI, Goodfire, LawZero, MIRI, CAIF, Center on Long-Term Risk, Beneficial AI Foundation, SaferAI, Haize Labs, EleutherAI, Harmony Intelligence, Conjecture, and joined academic research groups like UC Berkeley CHAI, NYU ARG, NU Bau Lab, Mila, and MIT Tegmark Group.
MATS provides mentorship, research funding, housing, and community so researchers can devote their energy to solving the world’s most important problem.
Fellows receive guidance from top researchers in AI alignment, governance, and security.
Fellows work with a dedicated research manager who helps scope projects, maintain progress, and remove blockers.
Fellows participate in seminars, workshops, and guest lectures led by experts across the alignment community.
Fellows receive a $15k stipend from AI Safety Support to cover living expenses.
Fellows are provided with $12k of compute resources to support experiments and evaluations.
Fellows have access to office space in Berkeley and London, and collaborate daily with fellow researchers.
Fellows receive catered lunches and dinners and are provided with private housing for the full duration of the program.
Fellows gain connections and networking opportunities across the broader AI alignment ecosystem.
Fellows may be invited to join the London-based extension program for an additional 6–12 months of research.
The body of research produced by MATS fellows spans the full spectrum of advancing AI safety, resilience, and understanding. Scholars investigate the inner workings of modern AI systems through mechanistic interpretability, sparse feature analysis, studies of latent representations and other techniques.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Authors:
Fellow: Daniel Tan
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
Date:
Feb 24, 2025
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Authors:
Hoagy Cunningham
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
Date:
Sep 15, 2023
AI agents find $4.6M in blockchain smart contract exploits
AI models are increasingly good at cyber tasks, as we've written about before. But what is the economic impact of these capabilities? In a recent MATS and Anthropic Fellows project, our scholars investigated this question by evaluating AI agents' ability to exploit smart contracts on Smart CONtracts Exploitation benchmark (SCONE-bench)—a new benchmark they built comprising 405 contracts that were actually exploited between 2020 and 2025. On contracts exploited after the latest knowledge cutoff (March 2025), Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 developed exploits collectively worth $4.6 million, establishing a concrete lower bound for the economic harm these capabilities could enable. Going beyond retrospective analysis, we evaluated both Sonnet 4.5 and GPT-5 in simulation against 2,849 recently deployed contracts without any known vulnerabilities. Both agents uncovered two novel zero-day vulnerabilities and produced exploits worth $3,694, with GPT-5 doing so at an API cost of $3,476. This demonstrates as a proof-of-concept that profitable, real-world autonomous exploitation is technically feasible, a finding that underscores the need for proactive adoption of AI for defense.
Authors:
Fellow: Winnie Xiao
Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan Nicholas Carlini, Alwin Peng
Date:
Dec 1, 2025
Hands-on research using machine learning experiments to understand and improve model safety including AI control, interpretability, scalable oversight, evaluations, red-teaming, and robustness.
Research on policy frameworks and strategic dynamics including international coordination, institutional design, AI forecasting, and regulatory proposals.
Foundational research on the mathematical and philosophical principles underlying agency, alignment, and safe reasoning in advanced AI systems.
Research translating governance goals into technical mechanisms including compliance protocols, evaluation standards, and enforcement tools.
Research on hardware security and infrastructure-level mechanisms for monitoring and securing AI development and deployment — including side-channel analysis, cluster security, and physical-layer verification.
MATS aims to find and train talented individuals for what we see as the world’s most urgent and talent-constrained problem: reducing risks from unaligned artificial intelligence (AI). We believe that ambitious researchers from a variety of backgrounds have the potential to meaningfully contribute to the field of alignment research. We aim to provide the training, logistics, and community necessary to aid this transition. We also connect our fellows with financial support to ensure their stability and security.
MATS Research is an independent 501(c)(3) public charity (EIN: 99-0648563).
.webp)
MATS is an independent research and educational seminar program that connects talented researchers with top mentors in the fields of AI alignment, interpretability, and governance. The main goal of MATS is to help scholars develop as AI alignment researchers.
The MATS Program is a 12-week research fellowship designed to train and support emerging researchers working on AI alignment, interpretability, governance, and safety. Fellows collaborate with world-class mentors, receive dedicated research management support, and join a vibrant community in Berkeley focused on advancing safe and reliable AI. The program provides the structure, resources, and mentorship needed to produce impactful research and launch long-term careers in AI safety.
MATS mentors are leading researchers from a broad range of AI safety, alignment, governance, interpretability, and security domains. They include academics, industry researchers, and independent experts who guide scholars through research projects, provide feedback, and help shape each scholar’s growth as a researcher. The mentors represent expertise in areas such as:
Key dates
Application:
The main program will then run from early June to late August, with the extension phase for accepted fellows beginning in September.
MATS accepts applicants from diverse academic and professional backgrounds ranging from machine learning, mathematics, and computer science to policy, economics, physics, and cognitive science. The primary requirements are strong motivation to contribute to AI safety and evidence of technical aptitude or research potential. Prior AI safety experience is helpful but not required.
Applicants submit a general application, applying to various tracks (technical governance, empirical, policy & strategy, theory, and compute governance) and streams within those tracks.
After a centralized review period, applicants who are advanced will then undergo additional evaluations depending on the preferences of the streams they've applied to before doing final interviews and receiving offers.
For more information on how to get into MATS, please look at this page.