Anthropic

This coalition of mentors make up the “Anthropic Stream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Fellows in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).

Anthropic mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the fellow experience. By applying to this stream, you are being considered for all of the Anthropic mentors.

Stream overview

This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:

  1. Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
  2. Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., alignment-fakingsycophancy or reward-tampering).
  3. Improving the robustness of LLMs to red teaming (e.g., via constitutional classifiersred teaming with language models or pretraining with human preferences or red teaming with best-of-n jailbreaks).
  4. Control - techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques (see Ctrl-Z).
  5. Scalable oversight – the problem of supervising systems that are more capable than human overseers

Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks

These projects involve running a large number of machine learning experiments, to gain empirical feedback on safety techniques and failures.

Mentors

Ethan Perez
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Ethan Perez is a researcher at Anthropic, where he leads a team working on AI control, adversarial robustness, and other areas of AI safety research. His interests span many areas of LLM safety; he's previously led work on sleeper agentsred-teaming language models with language modelsdeveloping AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.

Read more
Fabien Roger
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and dealing with alignment faking.

Read more
Jack Lindsey
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Hi, I'm Jack! I'm interested in understanding the cognition of modern language models, so that we can make them more reliable and aligned with human values. Currently, I lead the "Model Psych" team at Anthropic. We study the internal basis of higher-level cognitive phenomena in LLMs, like introspection, situational awareness, personas, and representations of emotion. We apply these techniques to audit Anthropic’s production models, for instance by monitoring their neural activity for signatures of deception, manipulation, or awareness of being evaluated. Previously, I did my PhD in the Center for Theoretical Neuroscience at Columbia University. For a list of my publications, see my Google Scholar profile.

Read more
Joe Benton
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Joe is a member of the Alignment Science team at Anthropic. He's currently working on scalable oversight and also has interests in control, chain-of-thought monitoring, and alignment evaluations. For some examples of recent projects, including MATS collaborations, see: https://joejbenton.com/research/.

Read more
Kyle Fish
Anthropic
,
Model Welfare Lead
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Kyle works on model welfare at Anthropic. He previously co-founded Eleos AI Research, Telis Bioscience, and Alvea. 

Read more
Nicholas Carlini
Anthropic
,
Research Scientist
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Nicholas is a research scientist at Google DeepMind researching adversarial machine learning; he likes to break things.

Read more
Sam Bowman
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Sam Bowman leads a research group working on AI alignment and welfare at Anthropic, with a particular focus on evaluation. Sam is also on leave from NYU as an Associate Prof. of Computer Science and Data Science. He has been studying neural network language models since 2012.

Read more
Samuel Marks
Anthropic
,
Member of Technical Staff
Boston
Control
Model Organisms
Red-Teaming
Scheming and Deception

Sam leads the Cognitive Oversight subteam of Anthropic's Alignment Science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there's anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is "detecting when language models are lying, including in cases where it's difficult to tell based solely on input/output". His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). For more flavor on this research direction, see his post here.

Read more
Sara Price
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Hi! I'm an independent AI alignment and safety researcher currently based out of the Bay Area. I've been working in machine learning since 2016 and made the switch to AI alignment work at the beginning of 2024 while participating in the MATS program.

My most recent work has focused on adversarial robustness of multimodal LLMs. We have been studying novel attacks that exploit the stochastic nature of LLM outputs in conjunction with their sensitivity to variations in continuous input spaces (i.e. audio or vision modalities).

Read more
Scott Emmons
Anthropic
,
Research Scientist
SF Bay Area
Control
Red-Teaming
Monitoring
Model Organisms
Safeguards

I research AI safety and alignment. Most recently, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded FAR.AI, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.

I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics to preserve monitorability during model development, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT, a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.

Read more
Stephen McAleer
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

Stephen is currently a researcher at OpenAI where he researches how to align and control superintelligence. He was previously a postdoc at CMU working with Tuomas Sandholm. Stephen received his PhD in computer science from the University of California, Irvine working with Pierre Baldi. During his PhD, he did research scientist internships at Intel Labs and DeepMind. Before that, Stephen received his bachelor's degree in mathematics and economics from Arizona State University in 2017. Projects he is interested in include:

  • Anything related to control/monitoring for coding agents
  • Scalable oversight for agent alignment
  • Scheming evaluations and mitigations
  • Adversarial training for robust monitors / reward models
  • Reward hacking / deception in agents
Read more
Trenton Bricken
Anthropic
,
Member of Technical Staff
SF Bay Area
Control
Model Organisms
Red-Teaming
Scheming and Deception

I'm a Member of Technical Staff on the Alignment Science team at Anthropic. I'm currently enabling Claude to automatically audit and detect misalignment.

About me

  • I have a PhD in Systems Biology from Harvard. My thesis was on "Sparse Representations in Biological and Artificial Neural Networks" in the Kreiman Lab with support from the NSF Graduate Research Fellowship. I also spent time at the Berkeley Redwood Center for Theoretical Neuroscience as a visiting researcher.
  • I graduated from Duke University in May 2020 with a self-made major in "Minds and Machines: Biological and Artificial Intelligence". I was lucky to attend as a Robertson Scholar, which provided full funding during all four years, including summer experiences.
  • At Duke, I spent a year doing research in Dr. Michael Lynch's Lab attempting to use machine learning to design new CRISPR guide RNAs for safer, more effective genome editing. Afterwards, I was affiliated with Dr. Debora Marks's Lab at Harvard Medical School applying deep learning to protein design. I also contributed to the IARPA Fun GCAT and DARPA Biostasis programs.

Read more

Mentorship style

During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have a primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.

Fellows we are looking for

See the top of this post

Generally someone who can run a lot of experiments quickly.

You'll work with other scholars, co-mentors, and external collaborators.

Project selection

Mentorship starts with the “Project Pitch Session” Anthropic runs at the start of the program. Fellows get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it. Some projects are assigned co-mentors who are other supervisors who want to join the project.