AI Oversight + Control
As model develop potential dangerous behaviors, can we develop and evaluate methods to monitor and regulate AI systems, ensuring they adhere to desired behaviors while minimally undermining their efficiency or performance?
Mentors
Buck Shlegeris
CEO, Redwood Research
Buck is investigating control evaluations for AI, including adversarial training to detect and prevent malicious behaviors, and exploring techniques to ensure AI safety through effective supervision and oversight.
-
Buck Shlegeris is the CEO of Redwood Research, a nonprofit organization that focuses on applied alignment research for artificial intelligence. Previously, he was a researcher at the Machine Intelligence Research Institute.
-
-
Buck and Fabien’s current main focuses are:
Developing control evaluations, which AI developers could use to make robust safety cases for their training and deployment plans.
Evaluating and improving safety techniques by using these evaluations.
More generally, finding techniques to catch or prevent malicious behavior from scheming models.
Scholars working with Buck may investigate:
Control evaluations and control interventions in new contexts. For example, doing a project like the backdooring control project but in an agentic setting, or under different assumptions, as described here.
-
Candidates for this stream should have the following skill sets:
Strong programming, preferably in ML. The core activities of these research projects is getting models to do various things and measuring how well they do them. Some of the projects involve training models directly; many of them just involve using language model APIs. It’s crucial that candidates be very fast programmers and have broad familiarity with software engineering; it’s better if they are familiar with using ML libraries to fine-tune models on GPUs etc.
Quantitative reasoning. You should be comfortable with applied math like that which comes up in this post or this post–preferably you’d be able to solve problems like those with help, but at least you should be able to understand the writeups of the solutions and be able to apply those results.
Context on strategic thinking about AI safety. Ideally you’ll be familiar with technical AI safety, and core questions about AI takeoff speeds, timelines, and proposals for reducing risks.
Buck and Fabien expect to spend about two to three hours a week on each mentee; this will probably include a weekly meeting and conversations over Slack as needed. You might also get feedback on your work from other Redwood Research staff as you’re going.
David Lindner
Research Scientist, Google DeepMind
David Lindner works on scalable oversight, AI control, and dangerous capability evaluations at Google DeepMind.
-
David Lindner is a Research Scientist on Google DeepMind's AGI Safety and Alignment team where he works on scalable oversight, AI control, and dangerous capability evaluations. Most of his work is focused on detecting or preventing deceptive alignment. You can find more details on his website.
-
My stream will focus on detecting and preventing harm from deceptive alignment (mostly scheming) via evaluations, control, and red teaming.
-
I'm interested in supervise projects that might reduce risks from scheming AI agents. Specifically, there will likely be projects in the following areas:
Red-teaming control mitigations: Design and test protocols for evaluating control techniques, i.e., techniques designed to prevent scheming models from causing harm. Projects would involve playing a blue-team/red-team game where the blue team designs control techniques and the red team tries to break them. See Greenblatt et al. 2023 for an example of this.
Evaluating scheming-ability: Clarify which capabilities are necessary for models to scheme and design evaluations to detect them. This helps to inform when we might need to implement mitigations. For examples of such evaluations, see Section 7 of Phuong et al. 2024 or Laine et al 2024.
Monitoring for deceptive reasoning: Designing and evaluating monitoring systems to detect deceptive reasoning. For example, a basic monitoring system could use a small LLM prompted to detect deceptive reasoning in the outputs of a larger LLM.
Steganographic deceptive reasoning: Monitoring fails if agents are able to hide deceptive reasoning in their outputs. Projects in this category would investigate under what circumstances models learn to do this.
Sandbagging capabilities: Under which circumstances are models able to 'sandbag', ie. fake low abilities in order to achieve a goal. Projects could involve designing demonstrations and evaluations of this phenomenon.
This list is not exhaustive, and I will likely come up with other project proposals until the program starts. But all projects will have a similar flavor to the list above.
All of my projects are fairly empirical and include running machine learning experiments daily. You should expect spending 80% or more of your time writing code and running ML experiments. For most scholars, I recommend choosing from a set of fairly concrete projects proposed by me. I'm happy for scholars with existing research experience (e.g., have successfully led ML research projects in the past) to propose their own project proposals too. Successful projects can often result in a publication, for example a workshop/conference paper or a blogpost.
-
All of my projects require being able to implement and quickly iterate on ML/LLM experiments. You are likely a good fit if you enjoy coding and prefer spending most of your time (>80%) implementing and trying out ideas.
Depending on your prior research experience, I will either provide you a concrete list of projects to choose from, or consider projects proposed by you. For most scholars, I prefer them to pick a project I have proposed. This allows you to get started on the technical work sooner and typically leads to better outcomes at the end of the program.
I strongly prefer applicants that likely want to participate in the MATS extension phase, as I think most projects need time past the main MATS research phase to be completed.
I aim to select scholars that are interested in related projects and I prefer most projects having multiple equal contributors. This makes empirical work much faster and allows you to learn from each other via pair programming. I also encourage collaborations with other streams if there are shared interests.
Erik Jenner
PhD Student, UC Berkeley / CHAI
Erik Jenner is mentored by Stuart Russell and concentrates on mechanistic anomaly detection in AI systems. His work seeks to identify when AI models generate outputs for atypical reasons.
-
Erik is a PhD student at the Center for Human-Compatible AI at UC Berkeley, advised by Stuart Russell. He primarily works on mechanistic anomaly detection. Before, he has also done work on interpretability (in particular, finding evidence of learned look-ahead in a chess-playing network) and reward learning. He has worked with around half a dozen mentees through MATS, CHAI internships, and otherwise.
-
We will use internal activations of models for runtime monitoring, i.e., catching bad behavior like a treacherous turn or trojans when it happens. Projects will be empirical, with some conceptual components (e.g., how to design meaningful proxies for scheming).
-
I am interested in runtime monitoring using model internals. This means trying to catch bad behavior by a model when it happens (e.g., backdoors/trojans or jailbreaks right now and a treacherous turn or measurement tampering in the future). I've so far mostly been working on mechanistic anomaly detection, but I am also interested in other model internals-based methods such as probing.
I'm excited about this direction because noticing bad behavior at runtime might be much easier than ensuring it doesn't happen on any possible inputs. Monitoring techniques also nicely complement methods like safety training that aim to prevent bad behavior in the first place. And while I expect black-box monitoring to be the best approach at first, I expect we will need to use model internals at some point.
Examples of potential project directions include:
Developing and testing algorithms for anomaly detection/monitoring inspired by ideas from mechanistic interpretability (e.g. SAEs, circuit discovery, or attribution patching). Can we outperform simple non-mechanistic baselines?
Red-teaming anomaly detection and other model internals methods such as probes. For example, train networks to obfuscate their internals and make model internals methods fail. How easy is this? Are some monitoring methods more robust than others?
Creating collections of tasks that let us benchmark model internals methods and compare different approaches (e.g., mechanistic anomaly detection vs coup probes). Which task parameters does performance depend on, and are there some methods that work well consistently?
Developing new monitoring methods (for example, can we combine unsupervised anomaly detection methods with coup probes, and does this outperform simply ensembling them?)
Designing and implementing games analogous to control evaluations but for model internals-based monitoring (i.e., with white-box access for the defender). Are there cases where existing white-box methods improve over black-box monitoring?
See this blog post for additional potential projects specifically on mechanistic anomaly detection.
-
Good programming (Python) and deep learning skills are important. You will spend a lot of time implementing and running deep learning experiments (e.g., training/finetuning networks and implementing monitoring techniques).
Most projects will also involve some conceptual thinking and will be on the open-ended side (rather than executing on a single very concrete idea), so you should feel comfortable with that.
Existing research experience and/or deep learning projects are a big bonus.
By default, I'll meet with you once a week for 1h, but I can do additional ad-hoc meetings (e.g., to pair on implementing/debugging occasionally). Between meetings, I encourage updates and questions on Slack and will usually respond in <24h.
I encourage collaboration with other scholars in the stream (but this will depend on stream size).
I will suggest a list of projects to pick from (roughly at the level of detail as here), and then concretize and extend the initial idea together with you. You can also pitch me your own ideas related to model internals/monitoring, but I'll have a high bar for those (feel free to check with me in advance if that's decision-relevant).
Ethan Perez
Research Scientist, Anthropic
Ethan is leading research on adversarial robustness and control of large language models at Anthropic, focusing on techniques such as red-teaming with language models and building model organisms of misalignment.
-
Ethan Perez is a researcher at Anthropic, where he leads a team working on LLM adversarial robustness and AI control. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.
-
This stream's mentors will be Ethan Perez, Fabien Roger, Newton Cheng, and Mrinank Sharma, with Ethan Perez being the primary mentor.
The Anthropic stream scholars always meet weekly with one or more supervisors. Each project has a designated lead supervisor (E.g. Ethan), who's expected to attend as many of the weekly meetings as possible. The lead supervisor is also the main decisionmaker on the key milestones for the project, and is the default person to go to for feedback, advice, etc. All other supervisors, either those cosupervising directly on the project or across the rest of the stream, attend weekly meetings and meet with scholars for feedback ad hoc, being more present earlier on.
Mentors are assigned within the stream based on the outcomes of our project pitching session at the start of each new round of projects.
-
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
Developing techniques for process-based supervision, such as learning from language feedback.
Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy and power-seeking).
Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
-
You’re probably a great fit if you enjoy/would be good at coding, running machine learning experiments, and doing highly empirical work (spending 95% of your time doing this kind of work). You’re probably a better fit for other streams if you’re looking to do work that is heavily or primarily conceptual, theoretical, or mathematical in nature (though some projects will involve thinking through some parts of conceptual alignment, and how to test ideas there empirically). The day-to-day work is fairly fast-paced and involves a lot of writing Python scripts, using the OpenAI and Anthropic APIs to prototype out ideas with language models, etc.
I’m currently only seeking applications from people who are at least 25% likely to want to continue working together full-time post-MATS (e.g., 4-6 additional months post-MATS until the research project runs to completion).
My projects involve learning to execute well on a well-scoped research project (as a first step for getting into research). I will have several project ideas which you would be able to score your interest/fit with, and I’ll match you with a project that fits your interests. If you’re excited (or not excited) about some of my past work, that’s probably reasonably representative of whether you’d be excited about the project we’d match you with. For people who have led 2+ machine learning research projects in the past, I may be more flexible, especially where we can scope out a project that seems promising to both of us.
My projects are fairly collaborative. All projects will have 2-4 full-time research contributors (e.g., people running experiments), and generally one more hands-on research co-advisor (generally someone more experienced). I’ll provide feedback on the project primarily through weekly project meetings, one-off 1:1 meetings as needed, or random in-person check-ins/chats.
Fabien Roger
Member of Technical Staff, Anthropic
Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and preventing sandbagging.
-
Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and preventing sandbagging.
-
This stream's mentors will be Ethan Perez, Fabien Roger, Newton Cheng, and Mrinank Sharma, with Ethan Perez being the primary mentor.
The Anthropic stream scholars always meet weekly with one or more supervisors. Each project has a designated lead supervisor (E.g. Ethan), who's expected to attend as many of the weekly meetings as possible. The lead supervisor is also the main decisionmaker on the key milestones for the project, and is the default person to go to for feedback, advice, etc. All other supervisors, either those cosupervising directly on the project or across the rest of the stream, attend weekly meetings and meet with scholars for feedback ad hoc, being more present earlier on.
Mentors are assigned within the stream based on the outcomes of our project pitching session at the start of each new round of projects.
-
Please see Ethan's research projects.
-
See the top of this post: https://www.alignmentforum.org/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
Generally someone who can run a lot of experiments quickly.
Jan Leike
Alignment Science Co-Lead, Anthropic
Jan co-leads the Alignment Science team at Anthropic. Previously, he co-led the Superalignment Team at OpenAI, where he was involved in the development of InstructGPT, ChatGPT, and the alignment of GPT-4.
-
Jan co-leads the Alignment Science team at Anthropic. Previously, he co-led the Superalignment Team at OpenAI, where he's been involved in the development of InstructGPT, ChatGPT, and the alignment of GPT-4. He developed OpenAI’s approach to alignment research and co-authored the Superalignment Team’s research roadmap. Prior to OpenAI, he was an alignment researcher at DeepMind where he prototyped reinforcement learning from human feedback. He holds a PhD in reinforcement learning theory from the Australian National University. In 2023 and 2024 TIME magazine listed him as one of the 100 most influential people in AI.
-
TBD
-
Unsupervised honesty
In this project, we aim to develop unsupervised methods for making AI models maximally honest, avoiding the possible pitfall of models becoming "human simulators" that reflect human biases rather than their best internal beliefs. The goal is to create a "truth assignment" that satisfies certain consistencies while maximizing the model's conditional probability for each claim. By iteratively improving these truth assignments, we hope to optimize model honesty without relying on supervision.
See more info here.
-
TBD
Mrinank Sharma
Member of Technical Staff, Anthropic
-
-
This stream's mentors will be Ethan Perez, Fabien Roger, Newton Cheng, and Mrinank Sharma, with Ethan Perez being the primary mentor.
The Anthropic stream scholars always meet weekly with one or more supervisors. Each project has a designated lead supervisor (E.g. Ethan), who's expected to attend as many of the weekly meetings as possible. The lead supervisor is also the main decisionmaker on the key milestones for the project, and is the default person to go to for feedback, advice, etc. All other supervisors, either those cosupervising directly on the project or across the rest of the stream, attend weekly meetings and meet with scholars for feedback ad hoc, being more present earlier on.
Mentors are assigned within the stream based on the outcomes of our project pitching session at the start of each new round of projects.
-
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
1. Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
2. Developing techniques for process-based supervision, such as learning from language feedback.
3. Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy and power-seeking).
4. Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
5. Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
6. Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
-
See the top of this post: https://www.alignmentforum.org/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
Generally someone who can run a lot of experiments quickly.
Samuel Albanie
Staff Research Scientist, Google DeepMind
Samuel Albanie is a researcher at Google DeepMind focused on scalable oversight.
-
I am a research scientist at Google DeepMind where my current work focuses on scalable oversight. Previously, I was an Assistant Professor the University of Cambridge.
-
This stream will focus on empirically testing the effectiveness of scalable oversight protocols such as Debate (https://arxiv.org/abs/1805.00899) and Prover-Verifier Games (https://arxiv.org/abs/2407.13692).
-
AI systems will become increasingly capable of performing tasks that an unassisted human cannot do. To oversee AI execution of such tasks, several scalable oversight protocols (https://arxiv.org/abs/2211.03540) have been proposed.
To date, we have relatively limited empirical evidence demonstrating whether and where such protocols can provided a trusted oversight signal. The overarching theme of this stream will be to identify potential weaknesses in such protocols and to gather empirical evidence to assess whether these weaknesses manifest in protocol implementations with frontier LLMs.
Relevant recent work in this domain includes https://arxiv.org/abs/2211.03540, , https://arxiv.org/abs/2407.04622, https://arxiv.org/abs/2407.13692, https://arxiv.org/abs/2402.06782.
A related direction that we may explore in this stream is the design of evaluations for dangerous capabilities.
-
You are likely a good fit for this collaboration if you thrive in a scenario that involves iterating quickly with empirical experiments. This style of research typically involves many cycles of prototyping and testing in Python (often with LLM APIs to maximize speed). Consequently the following are requirements:
- Several years of ML engineering or research experience;
- The ability to own and execute a project, taking the initiative to unblock technical challenges proactively when they arise;
- The ability to get to grip with relevant research quickly (primarily by reading papers).
Project structure:
- The project will (ideally) be collaborative (with at least 2 individuals working on it full time).
- We will hold weekly meetings (ramping up if/when we approach a research output). My slack response time is typically <= 24 hours.
Newton Cheng
Research Scientist, Anthropic
Newton leads the effort on the Frontier Red Team at Anthropic to evaluate cyber misuse risks of AI models, with a focus on developing and understanding evaluations for cyber-relevant capabilities.
-
Newton is a researcher at Anthropic, where he leads the cyber misuse team on the Frontier Red Team. His interests are generally focused on threat modeling for cyber risks, developing increasing sophisticated and realistic evaluations of cyber-relevant capabilities, and advancing the science of LLM capability evaluation.
-
This stream's mentors will be Ethan Perez, Fabien Roger, Newton Cheng, and Mrinank Sharma, with Ethan Perez being the primary mentor.
The Anthropic stream scholars always meet weekly with one or more supervisors. Each project has a designated lead supervisor (E.g. Ethan), who's expected to attend as many of the weekly meetings as possible. The lead supervisor is also the main decisionmaker on the key milestones for the project, and is the default person to go to for feedback, advice, etc. All other supervisors, either those cosupervising directly on the project or across the rest of the stream, attend weekly meetings and meet with scholars for feedback ad hoc, being more present earlier on.
Mentors are assigned within the stream based on the outcomes of our project pitching session at the start of each new round of projects.
-
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
Developing techniques for process-based supervision, such as learning from languagefeedback.
Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy and power-seeking).
Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
To date, we have relatively limited empirical evidence demonstrating whether and where such protocols can provided a trusted oversight signal. The overarching theme of this stream will be to identify potential weaknesses in such protocols and to gather empirical evidence to assess whether these weaknesses manifest in protocol implementations with frontier LLMs.
Relevant recent work in this domain includes https://arxiv.org/abs/2211.03540, , https://arxiv.org/abs/2407.04622, https://arxiv.org/abs/2407.13692, https://arxiv.org/abs/2402.06782.
A related direction that we may explore in this stream is the design of evaluations for dangerous capabilities.
-
I am primarily looking for scholars that have:
expertise in cybersecurity/offense, security research, or related fields (2+ years in academia, any professional experience, or significant experience in competition)
reasonable experience with LLM research and evaluation (should be comfortable running basic experiments and generally carrying out the research process)
excitement to engage with the first-order effects of their work beyond the immediate research community, e.g. implications for the broader AI ecosystem, governance, policy etc.
The most valuable quality for any scholars I supervise is a high degree of agency and self-directedness -- the field of cyber + AI is very young, and it is important that you are able to execute projects in the face of ambiguity.
Scott Emmons
Research Scientist, Google DeepMind
-
I am a research scientist at Google DeepMind focused on AI safety and alignment. I am wrapping up my PhD at UC Berkeley’s Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.
-
This stream will primarily focus on (automated) red teaming and alignment evaluations. It might also include some mechanistic interpretability and game theory.
-
My project areas include: empirical projects on red teaming, alignment evals, and mechanistic interpretability; as well as theoretical projects on value alignment and deceptive AI.
Since the field moves quickly, I don’t yet know exactly which projects scholars will work on months in advance. They will likely resemble the types of projects I have helped advise, including:
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
https://arxiv.org/abs/2309.00236
A StrongREJECT for Empty Jailbreaks
https://arxiv.org/abs/2402.10260
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2402.17747
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
-
I work on both empirical deep learning projects as well as theory projects. For the empirical projects, software engineering experience -- and in particular deep learning engineering experience -- will be useful. For the theory projects, a mathematical background, and in particular game theory, will be useful.