Arthur Conmy

Arthur Conmy's MATS Stream focuses on evaluating interpretability techniques on current and future AI Safety problems.

This can involve creating new safety techniques, as well as creating benchmarks and measuring performance against baseline techniques.

Stream overview

I am broadly interested in research directions scholars are excited about, that can advance the quality of our AI Safety tools, and our confidence in them. Three particular areas of research that seem promising to me are:

  1. Reasoning Model Interpretability
  2. Eliciting Strange Behaviors from Models 'in the wild' (i.e. not by training model organisms)
  3. Model Organisms too :)

Mentors

Arthur Conmy
Google DeepMind
,
Senior Research Engineer
London
Interpretability
Red-Teaming
Monitoring

Arthur Conmy is a Research Engineer at Google DeepMind, on the Language Model Interpretability team with Neel Nanda.

Arthur's focus is on practically useful interpretability and related AI Safety research. For example, Arthur was one of core engineers who added probes to Gemini deployments first. Arthur has also recently led research on how to interpret reasoning models: https://arxiv.org/abs/2506.19143 and how to elicit knowledge from model organisms: https://arxiv.org/abs/2510.01070 (both through MATS).

In the past, Arthur did early influential work on automating interpretability, finding circuits. Previously, Arthur worked at Redwood Research.

Read more

Mentorship style

I meet 1h/week, in group meetings (scheduled).

I also fairly frequently schedule ad hoc meetings with scholars to check on how they're doing and to address issues or opportunities that aren't directly related to the project.

I'll help with research obstacles, including outside of meetings.

Representative papers

My MATS Winter 2025 Paper – not as interpretability focused, but we do use probing and this was an important paper increasing confidence about claims of unfaithfulness – we can find unfaithfulness in the wild, but the rates of unfaithfulness are low

Thought Anchors – early reasoning model interpretability work I co-supervised with Neel and Paul and Uzay, MATS Summer 2025

Benchmarking Interpretability – MATS Summer 2024 work evaluating Sparse Autoencoders (SAEs), one of several lines of evidence we used at DeepMind to deprioritise SAEs

and also Eliciting Secret Knowledge from Language Models from MATS Summer 2025 scholar Bartosz, which uses model organisms to evaluate interpretability tools

Scholars we are looking for

Executing fast on projects is highly important. But also having a good sense of which next steps are correct is also valuable, though I enjoy being pretty involved in projects, so it's somewhat easier for me to steer projects than it is for me to teach you how to execute fast from scratch. It helps to be motivated to make interpretability useful, and use it for AI Safety, too.

I will also be interviewing folks doing Neel Nanda's MATS research sprint who Neel doesn't get to work with.

I think collaborations are strong, so I would try to pair you with, for example, some of my MATS extension scholars, or most likely other scholars I take this round.

Project selection

Mentor(s) will talk through project ideas with scholar.