No results

Try adjusting filter and search criteria.

Anthropic

This coalition of mentors make up the “Anthropic Stream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Fellows in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).

Anthropic mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the fellow experience. By applying to this stream, you are being considered for all of the Anthropic mentors.

Stream overview

This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:

  1. Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
  2. Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., alignment-fakingsycophancy or reward-tampering).
  3. Improving the robustness of LLMs to red teaming (e.g., via constitutional classifiersred teaming with language models or pretraining with human preferences or red teaming with best-of-n jailbreaks).
  4. Control - techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques (see Ctrl-Z).
  5. Scalable oversight – the problem of supervising systems that are more capable than human overseers

Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks

These projects involve running a large number of machine learning experiments, to gain empirical feedback on safety techniques and failures.

Mentors

No items found.

Mentorship style

During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have a primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.

Fellows we are looking for

See the top of this post

Generally someone who can run a lot of experiments quickly.

You'll work with other scholars, co-mentors, and external collaborators.

Project selection

Mentorship starts with the “Project Pitch Session” Anthropic runs at the start of the program. Fellows get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it. Some projects are assigned co-mentors who are other supervisors who want to join the project.