MATS Fellow:
Lily Sun
Authors:
Arthur Conmy, Joshua Engels
Citations
Abstract:
Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its values. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman ) changes in values without training, using only the dataset, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate where values change during training and predict how they will change from just training data; we are excited about future work that further investigates such questions.
Building Comparative Motivation Profiles with Instrumental Interventions
Authors:
David Vella Zarb, Rustem Turtayev, Taywon Min
Date:
June 6, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.