No results

Try adjusting filter and search criteria.

Victoria Krakovna

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. The stream will run in person in London, with scholars working together in team(s). 

Stream overview

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. Some example directions: 

  • Building realistic evals / honeypots for scheming propensity (as similar to normal deployment settings as possible, not triggering eval awareness)
  • Propensity evals testing for other instrumental goals besides self-preservation (e.g. resource acquisition, goal preservation)
  • Investigate how different forms of evaluation awareness affect model behavior on propensity evals, and whether it's feasible / useful to influence what kind of evaluation the model believes it's in (e.g. capability vs safety eval). 

Mentors

No items found.

Mentorship style

During the program, we will meet once a week to go through any updates / results, and your plans for the next week. I'm also happy to comment on docs, respond on Slack, or have additional ad hoc meetings as needed.

Fellows we are looking for

  • Collaborative software engineering: You are comfortable writing high-quality code independently and quickly. You are fluent in standard software engineering practices, e.g. version control. You can navigate a codebase written not entirely by you and can build on it productively.
  • Familiarity with deceptive alignment, safety cases, capability evaluations, AI control, and related topics. This makes it much easier for us to be on the same page about the goals of the project, and is important for making day-to-day project decisions in a conceptually sound way.
  • Conceptual research ability: You can come up with ideas for interesting experiments that align with the project plan. You prioritise well.
  • Good communication, team player: You can clearly communicate your research ideas, experimental methodology, results, etc, verbally or in writing (writing is more important). You can notice and express confusion, uncertainty, disagreement, or dissatisfaction and are willing to work through conflicts. You impartially consider other people’s ideas, disagree respectfully, can admit when you’re wrong, and are willing to commit to the team’s direction.

Scholars will be working together in team(s)

Project selection

I will talk through project ideas with scholars