Alex Turner

Independent

—

Research scientist

Links

Focus

Interpretability, Agent Foundations

Stream

Team Shard

Alex is currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

Highlighted outputs from past streams:

Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)
- Introduced the now-staple technique of “steering vectors”
Steering GPT-2-XL by adding an activation vector
Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)
Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)
Gradient routing (MATS 6.0)
Unlearn and distill for making robust unlearning a reality