AI Interpretability

Rigorously understanding how ML models function may allow us to identify and train against misalignment. Can we reverse engineer neural nets from their weights, or identify structures corresponding to “goals” or dangerous capabilities within a model and surgically alter them?

Mentors

Neel Nanda
Research Engineer, Google DeepMind

Neel leads the mechanistic interpretability team at Google DeepMind, focusing on reverse-engineering the algorithms learned by neural networks to differentiate between helpful and deceptively aligned models and better understand language model cognition.

Neel leads the Google DeepMind mechanistic interpretability team. He previously worked on mechanistic interpretability at Anthropic on the transformer circuits agenda and as an independent researcher on reverse-engineering grokking and making better tooling and educational materials for the field.
When training an ML model, we may know that it will learn an algorithm with good performance, but it can be very hard to tell which one. This is particularly concerning when "be genuinely helpful and aligned" and "deceive your operators by acting helpful and aligned, until you can decisively act to take power" look behaviorally similar. Mechanistic interpretability is the study of taking a trained neural network, and analysing the weights to reverse engineer the algorithms learned by the model. In contrast to more conventional approaches to interpretability, there is a strong focus on understanding model internals and what they represent, understanding the model’s “cognition”, and putting a high premium on deep and rigorous understanding even if this involves answering very narrow and specific questions. Better interpretability tools seem useful in many ways for alignment, but mechanistic approaches in particular may let us better distinguish deceptive models from aligned ones.
What am I looking for in an application?
Some core skills in mech interp, that I’ll be looking for signs of potential for - I’m excited about both candidates who are OK at all of them, or who really shine at one:
- Empirical Truth-Seeking: The core goal of interpretability is to form true beliefs about models. The main way to do this is by running experiments, visualising the results, and understanding their implications for what is true about the model.
  - You can show this with transparent reasoning about what you believe to be true, and nuanced arguments for why?
- Practicality: A willingness to get your hands dirty - writing code, running experiments, and playing with models. A common mistake in people new to the field is too great a focus on reading papers and thinking about things, rather than just doing stuff.
  - You can demonstrate this by just having a bunch of experiments that show interesting things!
- Scepticism: Interpretability is hard and it is easy to trick yourself. A healthy scepticism must be applied to your results and how you interpret them, and often a well designed experiment can confirm or disprove your assumptions. It’s important to not overclaim!
  - Applications that make a small yet rigorous claim are much better than ones that make fascinating yet bold and wildly overconfident claims
  - You can show this with clear discussion of the limitations of your evidence and alternative hypotheses
- Agency & Creativity: Being willing to just try a bunch of stuff, generate interesting experiment ideas, and be able to get yourself unstuck
  - You can show this if I read your application and think "wow, I didn't think of that experiment, but it's a good idea"
- Intuitive reasoning: It helps a lot to have some intuitions for models - what they can and cannot do, how they might implement the behaviour
  - You can show this by discussing your hypotheses going into the investigation, and the motivation behind your experiments. Though "I just tried a bunch of shit to see what happened, and interpreted after the fact" is also a perfectly fine strategy
- Enthusiasm & Curiosity: Mech interp can be hard, confusing and frustrating, or it can be fascinating, exciting and tantalising. How you feel about it is a big input here, to how good at the research you are and how much fun you have. A core research skill is following your curiosity (and learning the research taste to be curious about productive things!)
  - I know this is easy to fake and hard to judge from an application, so I don’t weight it highly here
- I also want candidates who are able to present and explain their findings and thoughts clearly.
- I’m aware that applicants will have very different levels of prior knowledge of mech interp, and will try to control for this.
- A fuzzy and hard to define criteria is shared research taste - I want to mentor scholars who are excited about the same kinds of research questions that I am excited about! I recommend against trying to optimise for this, but mention it because I want to be transparent about this being a factor.
What background do applicants need?
- You don’t need prior knowledge or research experience of mech interp, nor experience of ML, maths or research in general. Though all of these help and are a plus!
  - I outline important pre-requisites here - you can learn some of these on the go, but each helps, especially a background in linear algebra, and experience coding.
- In particular, a common misconception is that you need to be a strong mathematician - it certainly helps, but I’ve accepted scholars with weak maths background who’ve picked up enough to get by
- Mech interp is a sufficiently young field that it just doesn’t take that long to learn enough to do original and useful research, especially with me to tell you what to prioritise!
How can I tell if I’d be a good fit for this?
- If you think you have the skills detailed above, that’s a very good sign!
- More generally, finding the idea of mech interp exciting, and being very curious about the idea of what’s inside models - if you’ve read a mech interp paper and thought “this is awesome and I want to learn more about it” that’s a good sign!
- The training phase of the program is fairly competitive, which some people find very stressful - generally, my impression is that participants are nice and cooperative, especially since you want to form teams, but ultimately less than half will make it through to the research phase, which sucks.

Apply here

Adrià Garriga-Alonso
Research Scientist, FAR AI

Adrià is a Research Scientist at FAR AI focusing on advancing neural network interpretability and developing rigorous methods for AI interpretability.

Adrià is a Research Scientist at FAR AI, where he focuses on detecting neural network planning through model internals and measuring progress in mechanistic interpretability. His previous interpretability work includes Automatic Circuit Discovery and Causal Scrubbing. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks with Prof. Carl Rasmussen.
To detect and prevent a neural network (NN) from scheming, the goal of this stream is to come up with algorithms to: 1) detect multiple plans that a NN is considering and 2) find what objective the NN uses to select the plan it executes. We make progress with probes and mech interp tools, on model organisms (e.g. https://arxiv.org/abs/2407.15421). We also work on improving mech interp tools and measurements of whether tools work.
The main goal of these research projects is to understand how (and when) do NNs learn to plan. The theory of change here is to be able to focus the planning of NNs solely into well-understood outlets (e.g. scratchpads). The long-term goal here is to have a 'probing' algorithm that, applied to a NN, yields its inner reward or finds that there isn't one.
Example possible projects for scholars this round:
- Probe for alternative plans in the LeelaChess model (see https://arxiv.org/abs/2406.00877)
- Find out more about how one plan is chosen over others in the Sokoban model organism (see https://arxiv.org/abs/2407.15421)
- Come up with a more realistic (LLM ish) model organism for planning
- Measure validity of sparse autoencoder (SAE) features better. Try to find simple descriptor that work very well for low-level features, and describe next layer's features as a simple combination of the previous layers.
Scholars in this stream in MATS 2024 worked on:
- A benchmark with ground-truth circuits for evaluating interpretability hypotheses, automatic discovery methods
- A benchmark for evaluating sparse autoencoders (SAEs) as a tool to retrieve the "true" features a model uses to compute
- Stronger ways of evaluating interpretability hypotheses (i.e. adversarial patching)
- Replicating existing interpretability work in MAMBA.
- Probing for plans in the Sokoban-planning model organism
N/a

Arthur Conmy
Research Engineer, Google DeepMind

Arthur’s research focuses on discovering and innovativing methods for automating interpretability and applying model internals to critical safety tasks.

Arthur Conmy is a Research Engineer at Google DeepMind, on the Language Model Interpretability team with Neel Nanda. His interests are in automating interpretability, finding circuits and making model internals techniques useful for AI Safety. Previously, he worked at Redwood Research (and did the MATS Program!).
I’m most interested in supervising projects that propose original ways to scale interpretability, and/or show that model internals techniques are helpful for safety-relevant tasks (e.g. jailbreaks, sycophancy, hallucinations). I work with Sparse Autoencoders a lot currently, but it’s not a requirement to work on these if you work with me.
I’m quite empirically-minded and prefer discussing experiments and implementation to theory or philosophy. I think this leads to stronger outputs from projects, but if you would prefer more theoretical research other mentors may be a better fit. We would likely meet once a week for over an hour, and then more frequently when you have a blog post or paper to put out.

Lee Sharkey
Chief Strategy Officer, Apollo Research

Lee is Chief Strategy Officer at Apollo Research. His main research interests are mechanistic interpretability and “inner alignment.”

Lee Sharkey is Chief Strategy Officer at Apollo Research. Previously, Lee was a Research Engineer at Conjecture, where he recently published an interim report on superposition. His main research interests are mechanistic interpretability and “inner alignment.” Lee’s past research includes “Goal Misgeneralization in Deep Reinforcement Learning” and “Circumventing interpretability: How to defeat mind-readers.”
Lee's stream will focus primarily on improving mechanistic interpretability methods (sometimes known as 'fundamental interpretability' research).
Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.
Safety motivations aside, capable models are fundamentally extremely interesting objects of study and doing digital neuroscience on them is comparatively much easier than studying biological neural systems. Also, being a relatively new subfield, there is a tonne of low hanging fruit in interpretability ripe for the picking.
I think mentorship works best when the mentee is driven to pursue their project; this often (but not always) means they have chosen their own research direction. As part of the application to this stream, I ask prospective mentees to write a project proposal, which forms the basis of part of the selection process. If chosen, depending on the research project, other Apollo Research staff may offer mentorship support.
What kinds of research projects am I interested in mentoring?
Until recently, I have primarily been interested in 'fundamental' interpretability research. But with recent fundamental progress, particularly from Cunningham et al. (2023), Bricken et al. (2023), and other upcoming work (including from other scholars in previous cohorts!), I think enough fundamental progress has been made that I'm now equally open to supervising applied interpretability work to networks of practical importance, particularly work that uses sparse dictionary learning as a basic interpretability method.
Here is a list of example project ideas that I'm interested in supervising, which span applied, fundamental, and philosophical interpretability questions. These project ideas are only examples (though I'd be excited if mentees were to choose one of them). If your interpretability project ideas are not in this list, there is still a very good chance I am interested in supervising for it:
- Examples of applied interpretability questions I'm interested in:
  - What do the sparse dictionary features mean in audio or other multimodal models? Can we find some of the first examples of circuits in audio/other multimodal models? (see Reid (2023) for some initial work in this direction)
  - Apply sparse dictionary learning to a vision network, potentially a convolutional network such as AlexNet or Inceptionv1, thus helping to complete the project initiated by the Distill thread that worked toward completely understanding one seminal network in very fine detail.
  - Can we automate the discovery of "finite state automata"-like assemblies of features, which partly describe the computational processes implemented in transformers, as introduced in Bricken et al. (2023).
- Examples of fundamental questions I'm interested in:
  - How do we ensure that sparse dictionary features actually are used by the network, rather than simply being recoverable by sparse dictionary learning? In other words, how can we identify whether sparse dictionary features are functionally relevant?
  - Gated Linear Units (GLUs)(Shazeer, 2020), such as SwiGLU layers or bilinear layers, are a kind of MLP that is used in many public (e.g. Llama2, which uses SwiGLU MLPs) and likely non-public frontier models (such as PaLM2, which also uses SwiGLU). How do they transform sparse dictionary elements? Bilinear layers are an instance of GLUs that have an analytical expression, which makes them attractive candidates for studying how sparse dictionary elements are transformed in nonlinear computations in GLUs.
  - Furthermore, there exists an analytical expression for transformers that use bilinear MLPs (with no layer norm) (Sharkey, 2023). Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution? This may help in identifying fundamental structures within transformers in a similar way that induction heads were discovered.
  - RMSNorm is a competitive method of normalizing activations in neural networks. It is also more intuitive to understand than layer norm. Studying toy models that use it (such as transformers that use only attention and RMS norm as their nonlinearities) seems like a good first step to understanding their role in larger models that use it (such as Llama2). What tasks can such toy transformers solve and how do they achieve it?
  - I'm also open to supervising singular learning theory (SLT)-related projects but claim no expertise in SLT. Signing up with me for such projects would be high risk. So I’m slightly less likely to supervise you if you propose to do such a project, unless the project feels within reach for me. I'd be open to exploring options for a relatively hands off mentorship if a potential mentee was interested in pursuing such a project and couldn't find a more suitable mentor.
- Examples of philosophical interpretability questions I'm interested in:
  - What is a feature? What terms should we really be using here? What assumptions do these concepts make and where does it lead when we take these assumptions to their natural conclusions? What is the relationship between the network’s ontology, the data-generating ontology, sparse dictionary learning, and superposition?
Again, these project ideas are only examples. I’m interested in supervising a broad range of projects and encourage applicants to devise their own if they are inclined. (I predict that devising your own will have a neutral effect on your chances of acceptance in expectation: It will positively affect your chances in that I’m most excited by individuals who can generate good research directions and carry them out independently. But it will negatively affect your chances in that I expect most people are worse than I am at devising research directions that I in particular am interested in! Overall, I think the balance is probably neutral.)
As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:
- Science background
  - What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?
- Quantitative skills
  - How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research?
  - Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.
- Engineering skills
  - How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)
- Other interpretability prerequisites
  - How likely is the candidate to have a good grasp of the content gestured at in Neel Nanda’s list of interpretability prerequisites?
- Safety research interests
  - How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?
- Conscientiousness
  - In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?
In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.
Mentorship looks like a 1 h weekly meeting by default with slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and light steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. For instance, in the previous round of MATS a collaboration that was organized through the EleutherAI interpretability channel worked quite well, culminating in Cunningham et al. (2023). With regard to write ups, I’m usually happy to invest substantial time giving inputs or detailed feedback on things that will go public.

Nina Panickssery
Member of Technical Staff, Anthropic

Open to a variety of projects in LLM interpretability and adversarial robustness.

Nina is a researcher at Anthropic, where she works on pretraining data research and some alignment projects. Before joining Anthropic, she worked as a software engineer at Stripe and participated in MATS summer 2023.
Open to a variety of projects in LLM interpretability and adversarial robustness, particularly practical applications of concept-based interpretability methods.
Some research directions I am currently interested in:
- Ways that we can leverage concept-based interpretability methods to better steer and oversee LLMs after deployment
- Mitigations for persona drift and long-context jailbreaks
- Understanding how post-training techniques affect model representations and mechanisms
- Machine unlearning and red-teaming of unlearning techniques
- Improving existing feature-extraction and probing methods
Happy to discuss any ideas mentees have/am flexible about choice of project.
I'm looking for research scholars who are experienced in Python and PyTorch + have some language model interpretability experience.

Hidenori Tanaka
Group Leader, Harvard/NTT Research

Hidenori Tanaka leads a Science of AI for Alignment group at CBS-NTT Program in Physics of Intelligence at Harvard University, where he integrates concepts and scientific methods from physics, neuroscience, and psychology to advance our understanding of AI models for alignment and safety.

Hidenori Tanaka leads the Science of AI for Alignment group at the CBS-NTT "Physics of Intel ligence" Program at Harvard University . His research integrates concepts and methods from physics, neuroscience, and psychology to advance our understanding of AI models for alignment and safety. His recent work focuses on emergent abilities in language models , multi-modal systems , and the algorithm ic understanding of in-context learning . Hidenori earned his PhD in theoretical physics at Harvard University under the supervision of David Nelson and Michael Brenner, and subsequently completed a Masason Fellowship at Stanford University, working with Surya Ganguli and Daniel Fisher.

AI Interpretability

Mentors

Neel NandaResearch Engineer, Google DeepMind

Adrià Garriga-AlonsoResearch Scientist, FAR AI

Arthur ConmyResearch Engineer, Google DeepMind

Lee SharkeyChief Strategy Officer, Apollo Research

Nina Panickssery Member of Technical Staff, Anthropic

Hidenori TanakaGroup Leader, Harvard/NTT Research