Where does Olmo get its values?

MATS Fellow:

Lily Sun

Authors:

Arthur Conmy, Joshua Engels

Citations

Citations

Abstract:

Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its values. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman ) changes in values without training, using only the dataset, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate where values change during training and predict how they will change from just training data; we are excited about future work that further investigates such questions.

Recent research

Diffuse AI Control on Fuzzy Tasks

Authors:

Mikhail Terekhov

Date:

June 8, 2026

Citations:

Building Comparative Motivation Profiles with Instrumental Interventions

Authors:

David Vella Zarb, Rustem Turtayev, Taywon Min

Date:

June 6, 2026

Citations:

Frequently asked questions

What is the MATS Program?
How long does the program last?