Mechanistic Interpretability Hub
Maintained by MATS.
Learning Resources
Mechanistic Interpretability Quickstart Guide - Neel Nanda
This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!
This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.
A Comprehensive Mechanistic Interpretability Explainer & Glossary - Neel Nanda
Written by Neel Nanda to accompany the TransformerLens library for mechanistic interpretability of language models. I love hearing if people find my work useful, please reach out! If you want to get involved in mechanistic interpretability research, check out my guide to getting started.
Neel Nanda’s MATS Training Program 2023-2024
This was written as an outline for Neel Nanda’s MATS Training program over the summer and winter 2023-2024 cohorts. It outlines a five-week training program with problems, experiments, readings, and projects.
200 Concrete Open Problems in Mechanistic Interpretability - Neel Nanda
A list of 200 concrete open problems in mechanistic interpretability, collected and written by Neel Nanda at the end of 2022.
TransformerLens lets you load in an open source language model, like GPT-2, and exposes the internal activations of the model to you. You can cache any internal activation in the model, and add in functions to edit, remove or replace these activations as the model runs. The core design principle I’ve followed is to enable exploratory analysis.
Working on Remote Machines - Lucia Quirke
This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!
Notes on Compute Resources for Mechanistic Interpretability
This was written for MATS scholars over the 2023-2024 Winter cohort under Neel Nanda who are doing mechanistic interpretability work. It collects some notes on different compute resource providers and alumni experiences using them.
Training programs
ML Alignment & Theory Scholars Program (MATS)
Applications for Neel Nanda’s stream close November 10, 2023 11:59 PM (PST); other streams close November 17, 2023 11:59 PM (PST).
Alignment Research Engineer Accelerator (ARENA)
Applications open soon.
Interpretability Hackathon - Apart Research
Status unknown.
Grant funders for independent research
The following funders might be receptive to grant requests for mechanistic interpretability research:
Currently closed.