Evaluations

Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?

Mentors

Evan Hubinger
Research Scientist, Anthropic

Evan is open to a range of projects from empirical to theoretical alignment research, specifically interested in deceptive alignment, predictive model conditioning, and situational awareness in LLMs.

Marius Hobbhahn
CEO, Apollo Research

Marius is working on quantifying AI evaluations and understanding model goals through behavioral analysis, aiming to refine AI auditing and oversight methods.

  • Marius Hobbhahn is currently building up a new technical AI safety organization, Apollo Research. His organization’s research agenda revolves around deception, interpretability, model evaluations and auditing. Before founding this organization, he was a Research Fellow at Epoch and a MATS Scholar while he pursued his PhD in Machine Learning (currently on pause) at International Max-Planck Research School Tübingen.

  • Here are suggestions for potential research projects supervised by Marius:

    • Quantifying evaluations: For evals to be maximally useful, we would like to make quantitative statements about their predictions, e.g. “it takes X effort to achieve Y capability”. This project would first investigate different ways of quantifying evals, e.g. FLOP, hours invested, money invested, and more (maybe 2 weeks). Then, the scholar would empirically test the robustness of these ideas, e.g. by identifying empirical scaling laws. The project allows for independence and gives the scholar the opportunity to decide on research directions.

    • Goal evaluation and identification in NNs: For alignment, it is very important to know whether models are goal-directed and, if yes, what goals they have. A deep understanding of goals might only be attainable through interpretability but behavioral measures might bring us quite far. This project aims to build simple behavioral evaluations of goal-directedness. The early parts of the project are well-scoped. After the initial phase is done, the scholar has more freedom to decide on research directions.

    • Measuring the quality of eval datasets: We want to understand how we can measure the quality of eval datasets. In the beginning, we would investigate if there are systematic biases between human-written and model-written evals on the Anthropic model-written evals dataset. Apollo has already invested ~2 weeks into this project and there are interesting early findings. This project is very well-scoped and can be done with very little research experience.

    If you have a high-quality research proposal, Marius might also be willing to supervise that. It’s also possible and encouraged to team up with other scholars to work on the same project. For more background and context on these research projects reading our post on Science of Evals might be helpful.

  • Mentorship:

    • Marius offers between 1-3h of calls per week (depending on progress) with asynchronous slack messages.

    • The goal of the project is to write a blog post or other form of publication.

    You might be interested in that project if:

    • You have some basic research experience and enthusiasm for research.

    • You have enough software experience that you can write the code for this project. 1000+ hours of Python might be a realistic heuristic.

    • You want to take ownership of a project. Marius will provide mentorship but you will spend most time on it, so it should be something you’re comfortable leading.

    You can either work on projects alone or partner up with other scholars.

Owain Evans
Research Associate, Oxford University

Owain researches situational awareness in LLMs, predicting the emergence of dangerous capabilities, and enhancing human abilities to interact with and control AI through understanding of honesty and deception.

  • Owain is currently focused on:

    • Defining and evaluating situational awareness in LLMs (relevant paper)

    • How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse)

    • Honesty, lying, truthfulness and introspection in LLMs

    He leads a research group in Berkeley and has mentored 25+ alignment researchers in the past, primarily at Oxford’s Future of Humanity Institute.

    • Defining and evaluating situational awareness in LLMs (relevant paper)

    • Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals)

    • Studying emergent reasoning at training time (“out-of-context” reasoning). See Reversal Curse.

    • Detecting deception and dishonesty in LLMs using black-box methods

    • Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA)

  • Some of Owain's projects involve running experiments on large language models. For these projects, scholars need to have some kind of experience running machine learning experiments (either with LLMs or some other kind of machine learning model).