This coalition of mentors make up the “Anthropic Stream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Fellows in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).
Anthropic mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the fellow experience. By applying to this stream, you are being considered for all of the Anthropic mentors.
This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:
Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks
These projects involve running a large number of machine learning experiments, to gain empirical feedback on safety techniques and failures.
Ethan Perez is a researcher at Anthropic, where he leads a team working on AI control, adversarial robustness, and other areas of AI safety research. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.
Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and dealing with alignment faking.
Hi, I'm Jack! I'm interested in understanding the cognition of modern language models, so that we can make them more reliable and aligned with human values. Currently, I lead the "Model Psych" team at Anthropic. We study the internal basis of higher-level cognitive phenomena in LLMs, like introspection, situational awareness, personas, and representations of emotion. We apply these techniques to audit Anthropic’s production models, for instance by monitoring their neural activity for signatures of deception, manipulation, or awareness of being evaluated. Previously, I did my PhD in the Center for Theoretical Neuroscience at Columbia University. For a list of my publications, see my Google Scholar profile.
Joe is a member of the Alignment Science team at Anthropic. He's currently working on scalable oversight and also has interests in control, chain-of-thought monitoring, and alignment evaluations. For some examples of recent projects, including MATS collaborations, see: https://joejbenton.com/research/.
Kyle works on model welfare at Anthropic. He previously co-founded Eleos AI Research, Telis Bioscience, and Alvea.
Nicholas is a research scientist at Google DeepMind researching adversarial machine learning; he likes to break things.
Sam Bowman leads a research group working on AI alignment and welfare at Anthropic, with a particular focus on evaluation. Sam is also on leave from NYU as an Associate Prof. of Computer Science and Data Science. He has been studying neural network language models since 2012.
Sam leads the Cognitive Oversight subteam of Anthropic's Alignment Science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there's anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is "detecting when language models are lying, including in cases where it's difficult to tell based solely on input/output". His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). For more flavor on this research direction, see his post here.
Hi! I'm an independent AI alignment and safety researcher currently based out of the Bay Area. I've been working in machine learning since 2016 and made the switch to AI alignment work at the beginning of 2024 while participating in the MATS program.
My most recent work has focused on adversarial robustness of multimodal LLMs. We have been studying novel attacks that exploit the stochastic nature of LLM outputs in conjunction with their sensitivity to variations in continuous input spaces (i.e. audio or vision modalities).
I research AI safety and alignment. Most recently, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded FAR.AI, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.
I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics to preserve monitorability during model development, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT, a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.
Stephen is currently a researcher at OpenAI where he researches how to align and control superintelligence. He was previously a postdoc at CMU working with Tuomas Sandholm. Stephen received his PhD in computer science from the University of California, Irvine working with Pierre Baldi. During his PhD, he did research scientist internships at Intel Labs and DeepMind. Before that, Stephen received his bachelor's degree in mathematics and economics from Arizona State University in 2017. Projects he is interested in include:
I'm a Member of Technical Staff on the Alignment Science team at Anthropic. I'm currently enabling Claude to automatically audit and detect misalignment.
About me
During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have a primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.
You'll work with other scholars, co-mentors, and external collaborators.
Mentorship starts with the “Project Pitch Session” Anthropic runs at the start of the program. Fellows get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it. Some projects are assigned co-mentors who are other supervisors who want to join the project.