AI Evaluations

Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?

Mentors

Evan Hubinger
Research Scientist, Anthropic

Evan is open to a range of projects from empirical to theoretical alignment research, specifically interested in deceptive alignment, predictive model conditioning, and situational awareness in LLMs.

Evan Hubinger is a research scientist at Anthropic where he leads the Alignment Stress-Testing team. Before joining Anthropic, Evan was a research fellow at the Machine Intelligence Research Institute. Evan has done both very empirical alignment research, such as “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,” and very theoretical alignment research, such as "Risks from Learned Optimization in Advanced Machine Learning Systems.”
I am open to very empirical projects, very theoretical projects, and anything in between. Some particular areas of interest:
- Model organisms of misalignment: Anything in the agenda layed out in “Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research.”
- Conditioning predictive models: Anything in the agenda layed out in “Conditioning Predictive Models: Risks and Strategies.”
- Deceptive alignment: Anything related to deceptive instrumental pseudo-alignment as defined in "Risks from Learned Optimization in Advanced Machine Learning Systems.”
I tend to be pretty low-touch, so working with me will require you to be able to execute on projects fairly independently.
Be prepared for me to push you to make sure you have a clear story for how your work actually makes AI more likely to go well.

Marius Hobbhahn
CEO, Apollo Research

Marius works on scheming / deceptive alignment and does projects related to scheming, science of evals and control.

Marius Hobbhahn is the CEO and co-founder of Apollo Research. Apollo focuses on deception, interpretability, model evaluations and governance. Prior to that, he was a Research Fellow at Epoch and independent researcher while he pursued his PhD in Machine Learning at the International Max-Planck Research School Tübingen.
Marius' stream focuses on scheming, control and science of evals. His projects will be a scheming-related demo, a blue-team red-team game for control evals and designing honeypots.
Here are suggestions for potential research projects supervised by Marius:
- Builder-Breaker process for control: We play a blue-team red-team game for control evaluations. The blue team proposes a safety measure and argues why it's safe and the red-team designs an attack to break the setting. The goal is to figure out the basics for how to design such a process well. We will first do a few iterations only on paper and then actually play it with a real setup, code, etc. You might be a good fit for this project if you enjoy playing builder-breaker games and enjoy both light conceptual as well as empirical work. You can find more details here.
- Designing honeypots: To test whether models are well-aligned, we want to design adversarial alignment evaluations (aka honeypots) where the evaluators create a scenario that looks realistic and provides an easy opportunity for a model to perform a bad action if it was misaligned, i.e. a bait. Then, we test whether the model takes the bait. Designing honeypots is hard because we have to find situations that the model thinks are real. This project will try to design some high-quality honeypots and figure out the basic science of building them. You might be a good fit if you enjoy designing evals and tinkering with LM agents. You can find more details here.
- scheming-related demo: Design a setting that is as close to a real deployment setting, e.g. an automated AI scientist doing ML research, in which the model shows aspects of scheming without being explicitly instructed to do so. For an intuition what this could look like, see section "Safe Code Execution" in https://arxiv.org/abs/2408.06292. Though this demonstration would aim to be more reproducible and in a more controlled environment. This project will include a lot of LM agent scaffolding, prompting and potentially fine-tuning. You might be a good fit for this project if you enjoy tinkering with LM agents. You can find more details here.
If you have a high-quality research proposal, Marius might be willing to supervise that. However, the bar for this is extremely high, so in expectation, all scholars will work on one of the three projects above. For more background and context, you might want to check out our work on scheming safety cases (hopefully public by the time you read this): https://www.apolloresearch.ai/research
Mentorship:
- Marius offers between 1-3h of calls per week (depending on progress) with asynchronous slack messages.
- The goal of the project is to write a blog post or other form of publication.
You might be interested in that project if:
- You have some basic research experience and enthusiasm for research.
- You have enough software experience that you can write the code for this project. 1000+ hours of Python might be a realistic heuristic.
- You want to take ownership of a project. Marius will provide mentorship but you will spend most time on it, so it should be something you’re comfortable leading.
You can either work on projects alone or partner up with other scholars.

Francis Rhys Ward
PhD Student, Imperial College London

Rhys works on projects ranging from philosophy, to theoretical and empirical science, and technical AI governance. In this MATS cohort, he’s looking to work on evals and/or model organisms research.

I will soon finish my PhD, on formalising and evaluating AI deception.
Technically, my work involves both conceptual research, in the intersection of game theory, causality, and philosophy, in addition to empirical evaluations of frontier AI systems.
I am a member of Tom Everitt's Causal Incentives Working Group. Previously I have worked at the Centre for Assuring Autonomy, the Center on Long-Term Risk, the Centre for the Governance of AI, and the UK’s AI Safety Institute.
Most likely we will work on sandbagging related projects, including, for instance, evaluations of sandbagging related capabilities or model organisms of strategically hidden capabilities.
I'm pretty flexible but here are some ideas:
1. Sandbagging evals. Build a dangerous capability eval to measure capabilities related to sandbagging in LM agents, e.g., emulating the performance of a weaker agent on a coding eval.
2. Model organisms of goal misgeneralisation. For example,
3. When would an LM agent, fine-tuned with RL to use tools and achieve some goal over short episodes, learn a goal which generalised to longer episode. (Cf. Carlsmith's scheming report)?
4. Similarly, when would agents trained in settings which incentivise instrumental resource-acquisition learn to seek resources at larger scales in the test setting, even if this is no longer beneficial for the intended goal?
- Good communication -- it's important that you keep me updated with how you're doing and how I can help!
- Experience with LM experiments (e.g., fine-tuning).
- Clear thinker and writer.

Owain Evans
Director, Truthful AI; Affiliate Researcher, UC Berkeley CHAI

Owain researches situational awareness in LLMs, predicting the emergence of dangerous capabilities, and enhancing human abilities to interact with and control AI through understanding of honesty and deception.

Owain is currently focused on:
- Defining and evaluating situational awareness in LLMs (relevant paper)
- How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse)
- Honesty, lying, truthfulness and introspection in LLMs
He leads a research group in Berkeley and has mentored over 35 alignment researchers in the past, both at MATS and Oxford’s Future of Humanity Institute.
Goal: develop a scientific understanding of AI capabilities related to risk and mis-alignment, including situational awareness, hidden reasoning, and deception.
I have mentored 35+ researchers in AI safety, and past MATS projected have resulted in papers "The Reversal Curse" and "Me, Myself, and AI".
- Defining and evaluating situational awareness in LLMs (relevant paper)
- Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals)
- Studying emergent reasoning at training time (“out-of-context” reasoning). See also Reversal Curse.
- Detecting deception and dishonesty in LLMs using black-box methods
- Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA)
- Experience running ML experiments and working on ML research papers

Oliver Sourbut
Technical Staff, UK AISI

Oly works on autonomous systems at UK AISI, where he delivers threat modelling, evals, and some thought about what might need to come next for agent oversight and AI governance.

Oly has worked on the 'full stack' of autonomy evals at the UK AI Safety Institute: threat modelling, operationalising threats into evals, developing and running evals, agent elicitation, and feeding into technical and policy reports. Besides, he's very interested in hearing ideas towards answering, "what's next after evals?", whether it's mechinterp, oversight/control mechanisms, societal mitigations and hardening, alignment, or other means and cases for safety.
He's still technically doing an AI safety PhD at Oxford, in (LM) agent oversight and multiagent safety. In that capacity he engaged with policy folks at the UK FCDO and the OECD on AI safety between 2022-24. Before that he was a mathsy senior data scientist and software engineer, and one of the first beneficiaries of the MATS programme, in the 21-22 cohort.
This stream's mentors will be Oly Sourbut and Sid Black, with Oly Sourbut being the primary mentor.

We want scholars to build evaluations that are directly applicable to common threat models for AI. Scholars can either do more threat modelling work, or build evals themselves. We are especially excited by whitebox and interpretability evals, as well as infrastructure-heavy work (e.g. kubernetes & Docker), and evaluations targeting in-context exploration and error recovery.
Oly and Sid are working on dangerous capability evaluations for autonomous systems, in particular in the domains of AI R&D, general SWE skills, deception and persuasion, and autonomous replication and adaptation. They are also working on risk modelling for autonomous systems. They'd be interested in supervising related projects.
- Experienced software engineers if you're building evals
- Unusual backgrounds and domain expertise relevant to specific threats (e.g. dark web, networking, deception e.g. sociology skills) encouraged for threat modelling

Sid Black
Research Scientist, UK AISI

Sid Black is a Research Scientist at UK's AI Safety Institute (previously at EleutherAI and Conjecture), his current research interests include understanding and quantifying autonomy risks and threat models, increasing resilience to AI risks, white-box and predictive evaluations for AI agents, and developing standards for AI agent evaluations.

Sid Black is a Research Scientist at UK's AI Safety Institute building autonomy evaluations. He previously co-founded EleutherAI and Conjecture, where his work focused on pretraining and evaluating large language models, interpretability, and programming agents.
His current research interests include:
- understanding and quantifying autonomy risks from AI agents.
- Identifying relevant autonomy threat models.
- Building new evaluations which increase our coverage of these threat models.
- Researching ways to increase resilience to AI risks, in the case of unknown or unpredictable threat models.
- Developing standards for, and the science of, AI agent evaluations.
- Predictive evaluations and White-box evaluations for AI agents.
This stream's mentors will be Oly Sourbut and Sid Black, with Oly Sourbut being the primary mentor.

We want scholars to build evaluations that are directly applicable to common threat models for AI. Scholars can either do more threat modelling work, or build evals themselves. We are especially excited by whitebox and interpretability evals, as well as infrastructure-heavy work (e.g. kubernetes & Docker), and evaluations targeting in-context
Oly and Sid are working on dangerous capability evaluations for autonomous systems, in particular in the domains of AI R&D, general SWE skills, deception and persuasion, and autonomous replication and adaptation. They are also working on risk modelling for autonomous systems. They'd be interested in supervising related projects.
- Experienced software engineers if you're building evals
- Unusual backgrounds and domain expertise relevant to specific threats (e.g. dark web, networking, deception e.g. sociology skills) encouraged for threat modelling

Stephen Casper
PhD Student, MIT AAG

Cas's research focuses on red-teaming, audits, and exploring sociotechnical challenges posed by advanced AI.

Stephen (“Cas”) Casper is a Ph.D student at MIT in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. Most of his research involves red-teaming, auditing, and sociotechnical safety but he sometimes works on other types of projects too. In the past, he has worked closely with over a dozen mentees on various safety-related research projects.
Red-teaming, audits/evals, and sociotechnical AI safety.
- Some example work from Cas:
  Specific research topics for research with MATS can be flexible depending on interest and fit. However, work will most likely involve one of two topics (or similar):
  - Evaluating AI systems under latent-space and weight-space attacks. Currently, I am working on understanding how well latent- and weight-space attacks on AI systems can help us predict/upper-bound vulnerability to input-space attacks. Our goal is to help auditors assess an AI system's black-box capabilities in a way that is less vulnerable to overlooking threats from unforeseen adversarial prompts. In addition to our current work, there will be a lot of opportunities to expand and continue this line of work.
  - Sociotechnical AI impacts and impact evaluations. It's very challenging to understand the broader societal impacts of AI systems. In particular, there are several pernicious ways in which Goodhart's law can apply to sociotechnical AI impacts: (1) group A over group B, (2) developers over society, and (3) easily measurable impacts over hard-to-measure impacts. I'm currently doing some work to test the hypothesis that more 'aligned' language models express more culturally narrow viewpoints. I'm also interested in more related work along these lines.
  - Effective technical AI governance. Broadly, I am interested in technical work that can help with AI governance problems or inform the design of emerging AI governance frameworks.
- Significant research experience
- Enthusiasm about sociotechnical AI safety

Steven Basart
Research Manager, Center for AI Safety

Steven Basart is interested in supervising projects that have to do with expanding on representation engineering and expanding on the WMDP benchmark.

Steven is a Research and Reliability Engineer at the Center for AI Safety (CAIS), where he focuses on data evaluation and infrastructure development for AI safety projects. Before joining CAIS, he was a Software Engineer II at SpaceX, enhancing Kubernetes environments for satellite operations and supporting the StarLink mobility project. Steven also directed AI infrastructure at Autobon AI, managing data systems integration into AWS. His early research included internships at Google Brain, addressing fact-checking and content abuse, and at HERE Maps, developing predictive models. For more information, visit his website.
You will work closely with CAIS on one of our research projects. Rarely we do help support independent research projects, but we provide less internal support for this and it'll mostly be me supporting the scholar.
- Robustness - As AI agents become increasingly powerful, image hijacks or jailbreaks can lead to loss of control over powerful AI agents, and eventually to catastrophic outcomes. Adversarial robustness has historically been incredibly challenging, with researchers still unable to train adversarially robust MNIST classifiers. Despite this, we’ve developed a novel defense for LLMs and Multi-Modal Models, which so far is the most successful adversarial robustness technique. Preliminary experiments indicate that our defense is robust to jailbreaks and image hijacks of arbitrary strength in a highly reliable fashion. As such, it shows the potential to greatly reduce the risk of AIs aiding malicious users in building bioweapons or loss-of-control over powerful AI agents through hijacking.
- Representation Engineering - Our sense is that there’s a lot of low-hanging fruit and tractable progress on Representation Engineering. Example projects include improving the controllability of AI systems as well as better lie detection methods.
- Virology Evals - Ensuring that AI systems cannot create bioweapons involves measuring and removing hazardous knowledge from AI systems. Knowledge can be broken down into theoretical knowledge (Episteme) as well as tacit ability or skill (Techne). WMDP provided a way to measure and remove the theoretical knowledge needed to develop bioweapons. To fully address the problem, we need to develop measures and removal techniques for the tacit ability or skills necessary to develop bioweapons.
We are in search of research scholars who have demonstrated excellence in empirical work. A key criterion is having co-authored a paper presented at a top machine learning conference, such as ICML, ICLR, NeurIPS, CVPR, ECCV, ICCV, ACL, NAACL, or EMNLP. This requirement ensures that candidates have a solid foundation in conducting impactful research. We prefer individuals who are more inclined towards practical application rather than purely theoretical work. The ideal candidates are those who are not only skilled in their field but are also willing to tackle hands-on challenges in the pursuit of advancing our understanding and capabilities in AI.

AI Evaluations

Mentors

Evan HubingerResearch Scientist, Anthropic

Marius HobbhahnCEO, Apollo Research

Francis Rhys WardPhD Student, Imperial College London

Owain EvansDirector, Truthful AI; Affiliate Researcher, UC Berkeley CHAI

Oliver SourbutTechnical Staff, UK AISI

Sid BlackResearch Scientist, UK AISI

Stephen CasperPhD Student, MIT AAG

Steven BasartResearch Manager, Center for AI Safety