This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.
Since the field moves quickly, it’s too early to list exact projects. But here are rough directions to give some flavor:
Previous MATS projects in our stream included:
We’ll share a list of project proposals at the beginning of the program, and scholars will be able to choose and form teams around the project proposals. So we encourage you to apply if a decent fraction of these directions (or the general theme) sound exciting to you; you do not need to be excited about every single potential project.
Roland works as a Research Scientist at Google DeepMind as a member of the AGI Safety and Alignment team. He completed his Ph.D. at the University of Tuebingen / MPI-IS working with Wieland Brendel on interpretability, robustness and learning theory. His current work is focussed on evaluations and mitigations for deceptive alignment and scheming. More generally, he is interested in understanding the behavior, capabilities and limitations of AIs and their training procedures to increase trust and safety.
We are looking for scholars with strong machine learning engineering skills, as well as a background in technical research. While we’ll provide weekly guidance on research, we expect scholars to be able to run experiments and decide on low-level details fairly independently most of the time. We’ll propose concrete projects to choose from, so you should not expect to work on your own research idea during MATS. We strongly encourage collaboration within the stream, so you should expect to work in teams of 2-3 scholars on a project, hence good communication and team skills are important.