Evaluations
Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?
Mentors
Evan Hubinger
Research Scientist, Anthropic
Evan is open to a range of projects from empirical to theoretical alignment research, specifically interested in deceptive alignment, predictive model conditioning, and situational awareness in LLMs.
-
Evan Hubinger is a research scientist at Anthropic where he leads the Alignment Stress-Testing team. Before joining Anthropic, Evan was a research fellow at the Machine Intelligence Research Institute. Evan has done both very empirical alignment research, such as “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” and very theoretical alignment research, such as "Risks from Learned Optimization in Advanced Machine Learning Systems.”
-
I am open to very empirical projects, very theoretical projects, and anything in between. Some particular areas of interest:
Model organisms of misalignment: Anything in the agenda layed out in “Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research.”
Conditioning predictive models: Anything in the agenda layed out in “Conditioning Predictive Models: Risks and Strategies.”
Deceptive alignment: Anything related to deceptive instrumental pseudo-alignment as defined in "Risks from Learned Optimization in Advanced Machine Learning Systems.”
-
I tend to be pretty low-touch, so working with me will require you to be able to execute on projects fairly independently.
Be prepared for me to push you to make sure you have a clear story for how your work actually makes AI more likely to go well.
Marius Hobbhahn
CEO, Apollo Research
Marius is working on quantifying AI evaluations and understanding model goals through behavioral analysis, aiming to refine AI auditing and oversight methods.
-
Marius Hobbhahn is currently building up a new technical AI safety organization, Apollo Research. His organization’s research agenda revolves around deception, interpretability, model evaluations and auditing. Before founding this organization, he was a Research Fellow at Epoch and a MATS Scholar while he pursued his PhD in Machine Learning (currently on pause) at International Max-Planck Research School Tübingen.
-
Here are suggestions for potential research projects supervised by Marius:
Quantifying evaluations: For evals to be maximally useful, we would like to make quantitative statements about their predictions, e.g. “it takes X effort to achieve Y capability”. This project would first investigate different ways of quantifying evals, e.g. FLOP, hours invested, money invested, and more (maybe 2 weeks). Then, the scholar would empirically test the robustness of these ideas, e.g. by identifying empirical scaling laws. The project allows for independence and gives the scholar the opportunity to decide on research directions.
Goal evaluation and identification in NNs: For alignment, it is very important to know whether models are goal-directed and, if yes, what goals they have. A deep understanding of goals might only be attainable through interpretability but behavioral measures might bring us quite far. This project aims to build simple behavioral evaluations of goal-directedness. The early parts of the project are well-scoped. After the initial phase is done, the scholar has more freedom to decide on research directions.
Measuring the quality of eval datasets: We want to understand how we can measure the quality of eval datasets. In the beginning, we would investigate if there are systematic biases between human-written and model-written evals on the Anthropic model-written evals dataset. Apollo has already invested ~2 weeks into this project and there are interesting early findings. This project is very well-scoped and can be done with very little research experience.
If you have a high-quality research proposal, Marius might also be willing to supervise that. It’s also possible and encouraged to team up with other scholars to work on the same project. For more background and context on these research projects reading our post on Science of Evals might be helpful.
-
Mentorship:
Marius offers between 1-3h of calls per week (depending on progress) with asynchronous slack messages.
The goal of the project is to write a blog post or other form of publication.
You might be interested in that project if:
You have some basic research experience and enthusiasm for research.
You have enough software experience that you can write the code for this project. 1000+ hours of Python might be a realistic heuristic.
You want to take ownership of a project. Marius will provide mentorship but you will spend most time on it, so it should be something you’re comfortable leading.
You can either work on projects alone or partner up with other scholars.
Owain Evans
Research Associate, Oxford University
Owain researches situational awareness in LLMs, predicting the emergence of dangerous capabilities, and enhancing human abilities to interact with and control AI through understanding of honesty and deception.
-
Owain is currently focused on:
Defining and evaluating situational awareness in LLMs (relevant paper)
How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse)
Honesty, lying, truthfulness and introspection in LLMs
He leads a research group in Berkeley and has mentored 25+ alignment researchers in the past, primarily at Oxford’s Future of Humanity Institute.
-
Defining and evaluating situational awareness in LLMs (relevant paper)
Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals)
Studying emergent reasoning at training time (“out-of-context” reasoning). See Reversal Curse.
Detecting deception and dishonesty in LLMs using black-box methods
Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA)
-
Some of Owain's projects involve running experiments on large language models. For these projects, scholars need to have some kind of experience running machine learning experiments (either with LLMs or some other kind of machine learning model).