Scott Emmons

The stream will advance empirical methodologies for third-party AI safety evaluations. Example research topics include chain-of-thought monitorability, the secret loyalties research agenda, and automatic auditing (eg, with Anthropic’s Parallel Exploration Tool for Risky Interactions).

Stream overview

The stream will advance empirical methodologies for third-party AI safety evaluations. Example research topics include chain-of-thought monitorability, the secret loyalties research agenda, and automatic auditing (eg, with Anthropic’s Parallel Exploration Tool for Risky Interactions).

Mentors

SF Bay Area
Control
Dangerous Capability Evals
Red-Teaming
Monitoring
Model Organisms

I research AI safety and alignment. Most recently, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded FAR.AI, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.

I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics to preserve monitorability during model development, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT, a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.

Read more

Mentorship style

I will have scholars work in teams. During the week, the scholars will collaborate with each other and are encouraged to meet frequently. I will hold a weekly advising meeting for each project to provide help and guidance.

Representative papers

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

  • Establishes that—despite unfaithfulness concerns from prior work—CoT monitoring adds a substantial layer of defense when CoT is necessary for a model to carry out bad behavior.
  • Recommended by METR.

A Pragmatic Way to Measure Chain-of-Thought Monitorability

  • Designs metrics to track legibility and coverage, two key components of monitorability. Releases an autorater prompt developers can use to track how design decisions impact monitorability.

Obfuscated Activations Bypass LLM Latent-Space Defenses

A StrongREJECT for Empty Jailbreaks

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

  • Presents the secret loyalties research agenda.

Petri: An open-source auditing tool to accelerate AI safety research

  • An automatic auditing tool released by Anthropic.

Scholars we are looking for

My projects will be prioritizing impact on the field of AI safety over academic novelty. Beyond the skills of doing empirical AI safety research, I am looking for collaborators who are excited about doing sound and impactful science - including the mundane aspects of doing good science.

Probably will work with collaborators from stream

Project selection

I will provide a list of projects, but can also talk through other project ideas