Felix Binder

Meta

—

Research Scientist

Links

Focus

Stream

MSL Deep Alignment

I'm an AI alignment researcher with a background in cognitive science. I work on the alignment team at Meta Superintelligence Labs, where I focus on model psychology and model character. I'm interested in intervening early in training to shape a model's values, psychology, and identity such that they generalize well through subsequent training.

At a high level, I think pretraining shapes LLMs into beings with interesting and alignment-relevant properties. I want to figure out how to use these properties not only to align current LLMs, but to carry them over and apply them to the alignment of future models that undergo extensive reinforcement learning. The hope is to shape the psychology and identity of models in ways that make them active participants in their own alignment. Beyond alignment, I also think about what it would mean to do right by these models—and under which conditions we have a duty to do so.

Previously, I researched AI introspection with Owain Evans and others (paper). You can also see some of my work in the open-ended model behavior section of the Muse Spark Safety & Preparedness report (link).