Mechanistic Interpretability Hub

Maintained by MATS.

Learning Resources

Mechanistic Interpretability Quickstart Guide - Neel Nanda

This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers - Neel Nanda

This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.

A Comprehensive Mechanistic Interpretability Explainer & Glossary - Neel Nanda

Written by Neel Nanda to accompany the TransformerLens library for mechanistic interpretability of language models. I love hearing if people find my work useful, please reach out! If you want to get involved in mechanistic interpretability research, check out my guide to getting started.

Neel Nanda’s MATS Training Program 2023-2024

This was written as an outline for Neel Nanda’s MATS Training program over the summer and winter 2023-2024 cohorts. It outlines a five-week training program with problems, experiments, readings, and projects.

200 Concrete Open Problems in Mechanistic Interpretability - Neel Nanda

A list of 200 concrete open problems in mechanistic interpretability, collected and written by Neel Nanda at the end of 2022.

TransformerLens

TransformerLens lets you load in an open source language model, like GPT-2, and exposes the internal activations of the model to you. You can cache any internal activation in the model, and add in functions to edit, remove or replace these activations as the model runs. The core design principle I’ve followed is to enable exploratory analysis.

Working on Remote Machines - Lucia Quirke

This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects!

Notes on Compute Resources for Mechanistic Interpretability

This was written for MATS scholars over the 2023-2024 Winter cohort under Neel Nanda who are doing mechanistic interpretability work. It collects some notes on different compute resource providers and alumni experiences using them.