Roland S. Zimmermann (GDM stream)

This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.

Apply

View all streams

Stream overview

Since the field moves quickly, it’s too early to list exact projects. But here are rough directions to give some flavor:

Creating environments that can be used to evaluate and stress-test monitoring techniques (such as e.g. the APPS AI control environment)
Creating model organisms of dangerous behavior (e.g., MONA – finetune a model to execute bad policies in the type of environment mentioned above, rather than requiring human-designed scaffolds)
Building evals for capabilities related to scheming (such as Subversion Strategy Eval)
Adversarially stress-testing monitoring techniques (e.g. extensions of Obfuscated Activations for white-box monitors)
Improved white-box monitoring methods
Chain of thought monitoring (e.g. building/testing monitors, studying faithfulness/steganography, testing for which tasks CoT is needed or helps improve performance)

Previous MATS projects in our stream included:

Building model organisms of exploration hacking
Testing the ability of LLMs to encode their outputs and chain-of-thought
Testing the ability of LLMs to obfuscate their activations

We’ll share a list of project proposals at the beginning of the program, and scholars will be able to choose and form teams around the project proposals. So we encourage you to apply if a decent fraction of these directions (or the general theme) sound exciting to you; you do not need to be excited about every single potential project.

Mentors

Roland Zimmermann

Google DeepMind

Senior Research Scientist

Zurich

—

No items found.

Roland works as a Research Scientist at Google DeepMind as a member of the AGI Safety and Alignment team. He completed his Ph.D. at the University of Tuebingen / MPI-IS working with Wieland Brendel on interpretability, robustness and learning theory. His current work is focussed on evaluations and mitigations for deceptive alignment and scheming. More generally, he is interested in understanding the behavior, capabilities and limitations of AIs and their training procedures to increase trust and safety.

Fellows we are looking for

We are looking for scholars with strong machine learning engineering skills, as well as a background in technical research. While we’ll provide weekly guidance on research, we expect scholars to be able to run experiments and decide on low-level details fairly independently most of the time. We’ll propose concrete projects to choose from, so you should not expect to work on your own research idea during MATS. We strongly encourage collaboration within the stream, so you should expect to work in teams of 2-3 scholars on a project, hence good communication and team skills are important.

Roland S. Zimmermann (GDM stream)

Stream overview

Mentors

Mentorship style

Fellows we are looking for

Project selection