This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.
I'm interested in detecting and mitigating deceptive alignment (mainly via capability evaluations and control). I'm interested in supervising projects in these areas:
For each project, we will have a weekly meeting to discuss the overall project direction and prioritize next steps for the upcoming week. On a day-to-day basis, you will discuss experiments and write code with other mentees on the project (though I'm available on Slack for quick feedback between meetings or to address things that are blocking you).
I structure the program around collaborative, team-based research projects. You will work in a small team, on a project from a predefined list. I organize the 12-week program into fast-paced research sprints designed to create and keep research velocity, so you should expect regular deadlines and milestones. I will provide a more detailed schedule and set of milestones at the beginning of the program.
I am looking for fellows with strong machine learning engineering skills, as well as a background in technical research. While I’ll provide weekly guidance on research, I expect fellows to be able to run experiments and decide on low-level details fairly independently most of the time. I’ll propose concrete projects to choose from, so you should not expect to work on your own research idea during MATS. I strongly encourage collaboration within the stream, so you should expect to work in teams of 2-3 fellows on a project, hence good communication and team skills are important.
We design our stream to be highly collaborative. We encourage scholars to work together and possibly with external collaborators.
We will most likely have a joint project selection phase with the other GDM mentors, where we present a list of projects (with the option for fellows to iterate on them). Afterward, each project will have at least one main mentor, but we might also co-mentor some projects.