Steering Language Models With Activation Engineering

MATS Fellow:

Lisa Thiergart, David Udell, Ulisse Mini

Authors:

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Citations

384 Citations

Abstract:

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Recent research

SL5 Standard for AI Security

Authors:

Yoav Tzfati

Date:

March 10, 2026

Citations:

0

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Authors:

Roy Rinberg

Date:

March 5, 2026

Citations:

0

Frequently asked questions

What is the MATS Program?
How long does the program last?