Lee Sharkey

Lee's stream will focus primarily on improving mechanistic interpretability methods for reverse-engineering neural networks.

Apply

View all streams

Stream overview

Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.

The ability to reverse engineer what neural networks have learned promises one of the few ways that we might get assurances for safe generalization behaviour, especially as AI systems become vastly more capable than humans.

Safety motivations aside, capable AI systems are extremely interesting objects of study, and doing digital neuroscience on them is comparatively much easier than studying biological neural systems.

A majority of reverse-engineering-focused interpretability work has involved sparse dictionary learning, which has various issues Sharkey et al., 2025. In response to these issues, my team and I at Goodfire (and previously Apollo Research) developed a new approach to decomposing neural networks, called Parameter Decomposition. We have used parameter decomposition to resolve feature splitting, identify attention-head distributed computations, identify circuits, and more.

We believe there are gaps in our method that remain to be solved. In our work (whether SAEs or parameter decomposition), we minimize 'description length'. But we are not yet confident that we have the right 'type of description'. We think understanding computational manifolds (which are projections of activation manifolds) are likely part of the answer here, since they may offer an even more concise description of neural computation than SDL latents or VPD parameter subcomponents.

MATS projects in my stream should at least be conceptually informed by parameter decomposition, manifolds, and minimum description length framings of interpretability, if not build on them directly.

Mentors

Mentorship style

Mentorship looks like a 1 h weekly meeting by default with approximately daily slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. We'll often run daily standup meetings if timezones permit, but these are optional.

Fellows we are looking for

As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:

Science background: What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?

Quantitative skills: How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research? Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.

Engineering skills: How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)

Other interpretability prerequisites: How likely is the candidate to have a good grasp of the content gestured at in Neel Nanda’s list of interpretability prerequisites?

Safety research interests: How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?

Conscientiousness: In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?

In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.

In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. But in general I leave the decision to form collaborations up to each scholar.

Project selection

In general I'd like projects in my stream should at least be conceptually informed by parameter decomposition, manifolds, and minimum description length framings of interpretability, if not build on them directly.

Scholars and I will discuss projects and come to a consensus on what feels like a good direction. I will not tell scholars to work on a particular direction, since, in my experience, intrinsic motivation to work on a particular direction is important for producing good research.

No results