Interpretability

Rigorously understanding how ML models function may allow us to identify and train against misalignment. Can we reverse engineer neural nets from their weights, similar to how one might reverse engineer a binary compiled program?

Mentors

Neel Nanda
Research Engineer, Google DeepMind

Neel leads the mechanistic interpretability team at Google DeepMind, focusing on reverse-engineering the algorithms learned by neural networks to differentiate between helpful and deceptively aligned models and better understand language model cognition.

  • Neel leads the Google DeepMind mechanistic interpretability team. He previously worked on mechanistic interpretability at Anthropic on the transformer circuits agenda and as an independent researcher on reverse-engineering grokking and making better tooling and educational materials for the field.

  • When training an ML model, we may know that it will learn an algorithm with good performance, but it can be very hard to tell which one. This is particularly concerning when "be genuinely helpful and aligned" and "deceive your operators by acting helpful and aligned, until you can decisively act to take power" look behaviorally similar. Mechanistic interpretability is the study of taking a trained neural network, and analysing the weights to reverse engineer the algorithms learned by the model. In contrast to more conventional approaches to interpretability, there is a strong focus on understanding model internals and what they represent, understanding the model’s “cognition”, and putting a high premium on deep and rigorous understanding even if this involves answering very narrow and specific questions. Better interpretability tools seem useful in many ways for alignment, but mechanistic approaches in particular may let us better distinguish deceptive models from aligned ones.

  • What am I looking for in an application?

    Some core skills in mech interp, that I’ll be looking for signs of potential for - I’m excited about both candidates who are OK at all of them, or who really shine at one:

    • Empirical Truth-Seeking: The core goal of interpretability is to form true beliefs about models. The main way to do this is by running experiments, visualising the results, and understanding their implications for what is true about the model.

      • You can show this with transparent reasoning about what you believe to be true, and nuanced arguments for why?

    • Practicality: A willingness to get your hands dirty - writing code, running experiments, and playing with models. A common mistake in people new to the field is too great a focus on reading papers and thinking about things, rather than just doing stuff.

      • You can demonstrate this by just having a bunch of experiments that show interesting things!

    • Scepticism: Interpretability is hard and it is easy to trick yourself. A healthy scepticism must be applied to your results and how you interpret them, and often a well designed experiment can confirm or disprove your assumptions. It’s important to not overclaim!

      • Applications that make a small yet rigorous claim are much better than ones that make fascinating yet bold and wildly overconfident claims

      • You can show this with clear discussion of the limitations of your evidence and alternative hypotheses

    • Agency & Creativity: Being willing to just try a bunch of stuff, generate interesting experiment ideas, and be able to get yourself unstuck

      • You can show this if I read your application and think "wow, I didn't think of that experiment, but it's a good idea"

    • Intuitive reasoning: It helps a lot to have some intuitions for models - what they can and cannot do, how they might implement the behaviour

      • You can show this by discussing your hypotheses going into the investigation, and the motivation behind your experiments. Though "I just tried a bunch of shit to see what happened, and interpreted after the fact" is also a perfectly fine strategy

    • Enthusiasm & Curiosity: Mech interp can be hard, confusing and frustrating, or it can be fascinating, exciting and tantalising. How you feel about it is a big input here, to how good at the research you are and how much fun you have. A core research skill is following your curiosity (and learning the research taste to be curious about productive things!)

      • I know this is easy to fake and hard to judge from an application, so I don’t weight it highly here

    • I also want candidates who are able to present and explain their findings and thoughts clearly.

    • I’m aware that applicants will have very different levels of prior knowledge of mech interp, and will try to control for this.

    • A fuzzy and hard to define criteria is shared research taste - I want to mentor scholars who are excited about the same kinds of research questions that I am excited about! I recommend against trying to optimise for this, but mention it because I want to be transparent about this being a factor.

    What background do applicants need?

    • You don’t need prior knowledge or research experience of mech interp, nor experience of ML, maths or research in general. Though all of these help and are a plus!

      • I outline important pre-requisites here - you can learn some of these on the go, but each helps, especially a background in linear algebra, and experience coding.

    • In particular, a common misconception is that you need to be a strong mathematician - it certainly helps, but I’ve accepted scholars with weak maths background who’ve picked up enough to get by

    • Mech interp is a sufficiently young field that it just doesn’t take that long to learn enough to do original and useful research, especially with me to tell you what to prioritise!

    How can I tell if I’d be a good fit for this?

    • If you think you have the skills detailed above, that’s a very good sign!

    • More generally, finding the idea of mech interp exciting, and being very curious about the idea of what’s inside models - if you’ve read a mech interp paper and thought “this is awesome and I want to learn more about it” that’s a good sign!

    • The training phase of the program is fairly competitive, which some people find very stressful - generally, my impression is that participants are nice and cooperative, especially since you want to form teams, but ultimately less than half will make it through to the research phase, which sucks.

Alex Turner
Research Scientist, Google DeepMind

Alex is conducting research on "steering vectors" to manipulate and control language models at runtime, alongside exploring mechanistic interpretability of AI behaviors in maze-solving contexts and the theory of value formation in AI.

  • Alex is a Research Scientist at Google DeepMind. He’s currently working on steering vectors on DeepMind internal models. He's currently working on algebraically modifying the runtime properties of language models and doing mechanistic interpretability on maze-solving agents (this work was done by the MATS Winter 2022 Cohort’s team shard, mechanistic interpretability subteam). He also sometimes does theoretical work on the shard theory of value formation. In the past, he formulated and proved the power-seeking theorems, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

    Highlight outputs from past streams:

  • Mechanistic interpretability

    Shard theory aims to predict how tweaks to learning processes (eg changes to the reward function) affect policy generalization (e.g. whether the AI prioritizes teamwork or personal power-seeking in a given situation). Alex would like to supervise projects that mechanistically interpret and control existing networks.

    Steering vectors

    Why and how do steering vectors work so well? What is a “science of steering” we can discover, or at least some statistical regularities in when and how they work vs don’t work? EG what should go into creating the vectors, what governs the intervention strength required, etc. This probably looks like careful ablations and automated iteration over design choices.

    Discovering qualitatively new techniques

    Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. What other subfields can we find together?

  • Ideal candidates would have:

    • Academic background in machine learning, computer science, statistics, or a related quantitative field.

    • Familiarity with ML engineering.

    • Proven experience working on machine learning projects, either academically or professionally.

    • Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.

    • Ability to write up results into a paper.

    Mentorship looks like:

    • Weekly meetings (1-on-1 and group)

    • Slack presence

    • Limited support otherwise (per Google DeepMind rules, I can’t contribute code to the projects)

Lee Sharkey
Chief Strategy Officer, Apollo Research

Lee is Chief Strategy Officer at Apollo Research. His main research interests are mechanistic interpretability and “inner alignment.”

  • Lee Sharkey is Chief Strategy Officer at Apollo Research. Previously, Lee was a Research Engineer at Conjecture, where he recently published an interim report on superposition. His main research interests are mechanistic interpretability and “inner alignment.” Lee’s past research includes “Goal Misgeneralization in Deep Reinforcement Learning” and “Circumventing interpretability: How to defeat mind-readers.”

  • Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.

    Safety motivations aside, capable models are fundamentally extremely interesting objects of study and doing digital neuroscience on them is comparatively much easier than studying biological neural systems. Also, being a relatively new subfield, there is a tonne of low hanging fruit in interpretability ripe for the picking.

    I think mentorship works best when the mentee is driven to pursue their project; this often (but not always) means they have chosen their own research direction. As part of the application to this stream, I ask prospective mentees to write a project proposal, which forms the basis of part of the selection process. If chosen, depending on the research project, other Apollo Research staff may offer mentorship support.

    What kinds of research projects am I interested in mentoring?

    Until recently, I have primarily been interested in 'fundamental' interpretability research. But with recent fundamental progress, particularly from Cunningham et al. (2023), Bricken et al. (2023), and other upcoming work (including from other scholars in previous cohorts!), I think enough fundamental progress has been made that I'm now equally open to supervising applied interpretability work to networks of practical importance, particularly work that uses sparse dictionary learning as a basic interpretability method.

    Here is a list of example project ideas that I'm interested in supervising, which span applied, fundamental, and philosophical interpretability questions. These project ideas are only examples (though I'd be excited if mentees were to choose one of them). If your interpretability project ideas are not in this list, there is still a very good chance I am interested in supervising for it:

    • Examples of applied interpretability questions I'm interested in:

      • What do the sparse dictionary features mean in audio or other multimodal models? Can we find some of the first examples of circuits in audio/other multimodal models? (see Reid (2023) for some initial work in this direction)

      • Apply sparse dictionary learning to a vision network, potentially a convolutional network such as AlexNet or Inceptionv1, thus helping to complete the project initiated by the Distill thread that worked toward completely understanding one seminal network in very fine detail.

      • Can we automate the discovery of "finite state automata"-like assemblies of features, which partly describe the computational processes implemented in transformers, as introduced in Bricken et al. (2023).

    • Examples of fundamental questions I'm interested in:

      • How do we ensure that sparse dictionary features actually are used by the network, rather than simply being recoverable by sparse dictionary learning? In other words, how can we identify whether sparse dictionary features are functionally relevant?

      • Gated Linear Units (GLUs)(Shazeer, 2020), such as SwiGLU layers or bilinear layers, are a kind of MLP that is used in many public (e.g. Llama2, which uses SwiGLU MLPs) and likely non-public frontier models (such as PaLM2, which also uses SwiGLU). How do they transform sparse dictionary elements? Bilinear layers are an instance of GLUs that have an analytical expression, which makes them attractive candidates for studying how sparse dictionary elements are transformed in nonlinear computations in GLUs.

      • Furthermore, there exists an analytical expression for transformers that use bilinear MLPs (with no layer norm) (Sharkey, 2023). Can we train a small bilinear transformer on either a toy or real dataset, perform sparse dictionary learning on its activations, and understand the role of each sparse dictionary feature in terms of the closed form solution? This may help in identifying fundamental structures within transformers in a similar way that induction heads were discovered.

      • RMSNorm is a competitive method of normalizing activations in neural networks. It is also more intuitive to understand than layer norm. Studying toy models that use it (such as transformers that use only attention and RMS norm as their nonlinearities) seems like a good first step to understanding their role in larger models that use it (such as Llama2). What tasks can such toy transformers solve and how do they achieve it?

      • I'm also open to supervising singular learning theory (SLT)-related projects but claim no expertise in SLT. Signing up with me for such projects would be high risk. So I’m slightly less likely to supervise you if you propose to do such a project, unless the project feels within reach for me. I'd be open to exploring options for a relatively hands off mentorship if a potential mentee was interested in pursuing such a project and couldn't find a more suitable mentor.

    • Examples of philosophical interpretability questions I'm interested in:

      • What is a feature? What terms should we really be using here? What assumptions do these concepts make and where does it lead when we take these assumptions to their natural conclusions? What is the relationship between the network’s ontology, the data-generating ontology, sparse dictionary learning, and superposition?

    Again, these project ideas are only examples. I’m interested in supervising a broad range of projects and encourage applicants to devise their own if they are inclined. (I predict that devising your own will have a neutral effect on your chances of acceptance in expectation: It will positively affect your chances in that I’m most excited by individuals who can generate good research directions and carry them out independently. But it will negatively affect your chances in that I expect most people are worse than I am at devising research directions that I in particular am interested in! Overall, I think the balance is probably neutral.)

  • As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:

    • Science background

      • What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?

    • Quantitative skills

      • How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research?

      • Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.

    • Engineering skills

      • How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)

    • Other interpretability prerequisites

    • Safety research interests

      • How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?

    • Conscientiousness

      • In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?

    In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.

    Mentorship looks like a 1 h weekly meeting by default with slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and light steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. For instance, in the previous round of MATS a collaboration that was organized through the EleutherAI interpretability channel worked quite well, culminating in Cunningham et al. (2023). With regard to write ups, I’m usually happy to invest substantial time giving inputs or detailed feedback on things that will go public.

Jessica Rumbelow
CEO, Leap Labs

Jessica's team at Leap Laboratories is working on novel AI interpretability techniques, applying them to state-of-the-art models to enhance understanding and control across various domains.

  • Jessica leads research at Leap Laboratories. She has previously published work on saliency mapping, AI in histopathology, and more recently glitch tokens and prototype generation. Her research interests include black-box/model-agnostic interpretability, data-independent evaluation, and hypothesis generation. Leap Laboratories is a research-driven startup using AI interpretability for knowledge discovery in basic science.

  • You'll work in a small team with your fellow scholars to undertake one or more research projects in AI interpretability. This will involve creative problem solving and ideation; reading existing literature and preparing literature reviews; designing and implementing experiments in clean and efficient code, documenting and presenting findings, and writing up your results for internal distribution and possible publication. Specific projects are yet to be determined, but will be broadly focussed on developing novel interpretability algorithms and/or applying them to SotA models across different modalities.

  • Candidates are expected to have some programming experience with standard deep learning frameworks (you should be able to train a model from scratch on a non-trivial problem, and debug it effectively), and to be able to read and implement concepts from academic papers easily.

    Candidates should be excited about documenting their research and code thoroughly – we keep daily lab books – and happy to join regular standup and research meetings with the Leap team.

    See more via Leap Labs culture.

Stephen Casper
PhD Student, MIT AAG

Stephen's research includes red-teaming AI systems to understand vulnerabilities, applying adversarial methods to interpret and improve robustness, and exploring the potential of RLHF.

  • Stephen (“Cas”) Casper is a Ph.D student at MIT in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. Most of his research involves interpreting, red-teaming, and auditing AI systems but he sometimes works on types of projects too. In the past, he has worked closely with over a dozen mentees on various alignment-related research projects.

  • Some example works from Cas:

    Specific research topics for research with MATS can be flexible depending on interest and fit. However, work will most likely involve one of three topics (or similar):

    • AI red-teaming as a game of carnival ring toss: In games of carnival ring toss, there are typically many (e.g. hundreds) of pegs. When you toss a ring, you usually miss, and even when you succeed, you usually only hit a few rings. Red teaming seems to be like this. The number of possible inputs to modern AI systems is hyper-astronomically large, and past attempts to re-engineer even known problems with systems sometimes result in finding brand new ones. What if red-teaming, while useful, will always fail to be thorough? I’m interested in work that rounds up existing evidence, considers this hypothesis, and discusses what it may mean for technical safety and auditing policy.

    • Latent Adversarial Training (LAT): Some of my past (see above) and present work involves Latent Adversarial Training as a way of making models forget/unlearn harmful capabilities. It is a useful tool in the safety toolbox because it can be a partial solution for problems like jailbreaks, deception, trojans, and black swans. However, there is fairly little research to date on it, and I think there is a lot of low-hanging fruit. Some types of projects could focus on studying different parameterizations of attacks, conducting unrestricted attacks, or using latent-space attacks for interpretability.

    • How well does robustness against generalized adversaries predict robustness against practical threats: Most red-teaming and adversarial training with LLMs involves text-space attacks. Generalized attacks in the embedding or latent space of an LLM are more efficient and likely to better elicit many types of failure modes, but the existence of an embedding/latent-space vulnerability does not necessarily imply the existence of a corresponding input-space one. I am interested in work that aims to quantify how well robustness to generalized attacks predicts robustness to input-space attacks and few-shot finetuning attacks. This could be useful for guiding useful research on technical robustness methods and helping to figure out whether generalized attacks should be part of the evals/auditing toolbox.

  • Positive signs of good fit

    • Research skills – see below under “Additional Questions”.

    • Good paper-reading habits.

    Mentorship will look like:

    • Meeting 2-3x per week would be ideal.

    • Frequent check-ins about challenges. A good rule of thumb is to ask for help after getting stuck on something for 30 minutes.

    • A fair amount of independence with experimental design and implementation will be needed, but Cas can help with debugging once in a while. Clean code and good coordination will be key.

    • An expectation for any mentee will be to regularly read and take notes on related literature daily.

    • A requirement for any mentee on day 1 will be to read these two posts, watch this video, and discuss them with Cas.

    • Cas is usually at MIT but sometimes visits other places. In-person meetings would be good but are definitely not necessary.

    Cas may interview/email some applicants prior to making final decisions for MATS mentees in this stream.

    Collaboration/mentorship can come in many forms! Even if it would not be a great fit for Cas to be an MATS mentor, he may be able to advise or assist with certain projects on an informal, less-involved basis. This should be initiated by emailing him with a project idea.

Adrià Garriga Alonso
Research Scientist, FAR Labs

Adrià is a Research Scientist at FAR AI focusing on advancing neural network interpretability and developing rigorous methods for AI interpretability.

  • Adrià is a Research Scientist at FAR AI. His previous relevant work includes Automatic Circuit Discovery, Causal Scrubbing, which are two attempts at concretizing interpretability. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks with Prof. Carl Rasmussen.

  • His work has two main threads:

    1) Make interpretability a more rigorous science. The goal here is to accelerate interpretability by putting everyone's findings in firmer footing, thus letting us move fast without fear of making things up. Long-term goals of this agenda are to make a functioning repository of circuits for some large model, which anyone can contribute to and which can use previously-understood circuits.

    2) Understand how (and when) do NNs learn to plan. The theory of change here is to be able to focus the planning of NNs solely into well-understood outlets (e.g. scratchpads). The long-term goal here is to have a 'probing' algorithm that, applied to a NN, yields its inner reward or finds that there isn't one.

    Scholars in this stream in MATS Winter 2024 worked on:

    • A benchmark with ground-truth circuits for evaluating interpretability hypotheses, automatic discovery methods, …

    • Stronger ways of evaluating interpretability hypotheses (i.e. adversarial patching)

    • Replicating existing interpretability work in MAMBA

  • Positive signs of good fit

    • Research skills – see below under “Additional Questions”.

    • Good paper-reading habits.

    Mentorship will look like:

    • Meeting 2-3x per week would be ideal.

    • Frequent check-ins about challenges. A good rule of thumb is to ask for help after getting stuck on something for 30 minutes.

    • A fair amount of independence with experimental design and implementation will be needed, but Cas can help with debugging once in a while. Clean code and good coordination will be key.

    • An expectation for any mentee will be to regularly read and take notes on related literature daily.

    • A requirement for any mentee on day 1 will be to read these two posts, watch this video, and discuss them with Cas.

    • Cas is usually at MIT but sometimes visits other places. In-person meetings would be good but are definitely not necessary.

    Cas may interview/email some applicants prior to making final decisions for MATS mentees in this stream.

    Collaboration/mentorship can come in many forms! Even if it would not be a great fit for Cas to be an MATS mentor, he may be able to advise or assist with certain projects on an informal, less-involved basis. This should be initiated by emailing him with a project idea.