Roger Grosse

Roger Grosse’s stream investigates how to improve influence functions and other training data attribution methods, and uses these tools to study alignment-related phenomena such as out-of-context reasoning and emergent misalignment. The ideal scholar has experience with LLM internals, strong statistics/applied math skills (especially numerical linear algebra), and can independently drive research from literature review through experimentation and analysis. Roger provides shovel-ready projects while giving exceptional scholars freedom to pursue their own ideas, and is open to scholars collaborating with others.

Stream overview

Ways to improve influence functions and/or other training data attribution methods, and/or to use training data attribution to understand alignment-related phenomena such as out-of-context reasoning or emergent misalignment.

Mentors

Roger Grosse
Anthropic
,
Associate Professor
Toronto
Interpretability
Read more

Mentorship style

I will meet with scholars 1 hour per week by default, and will be available to answer questions on Slack roughly daily.

Fellows we are looking for

  • Experience working with LLM model internals
  • Strong background in statistics and/or applied math (esp. numerical linear algebra)
  • Ability to carry out research independently on the timescale of weeks (reading the literature, formulating and carrying out experiments, interpreting results)
  • Ability and willingness to dig into details to get at the root causes of phenomena

Scholars are welcome to find collaborators if they'd find it valuable.

Project selection

I will give the scholar the level of freedom they are ready for. I will be prepared with focused, shovel-ready projects, but exceptional scholars with a vision they are excited about will have the flexibility to pursue it.