OpenAI Safety Team

Projects on this stream cluster into a few broad areas from the empirical track: scalable oversight, AI control, monitorability and interpretability, adversarial robustness, and security. 

Most fellows will work closely with one or two mentors on something that fits into the mentors' ongoing research. The above list of mentors above is tentative.

Stream overview

Projects on this stream cluster into a few broad areas from the empirical track: scalable oversight, AI control, monitorability and interpretability, adversarial robustness, and security. Narrower threads include personas and character training, reward hacking, model spec, automated AI research, and safety eval infrastructure. The list below pulls from what individual mentors on this stream are actively working on, and you can see more of our team's published work at alignment.openai.com. Most fellows will work closely with one or two mentors on something that fits into that mentor's ongoing research line.

Some example projects:

  • Oversight for autonomous coding agents, e.g. an evaluator that approves or denies a sandboxed agent's permission escalation requests (auto-review), agent and overseer dynamics such as collusion or post-denial behavior, and comparisons of automated and human oversight.
  • CoT and activation monitorability, representation-based interpretability, and privacy-preserving embeddings or monitorability for long-context models.
  • Control and anti-scheming work, including control red-teaming and CoT monitoring studies that use production data to build evals.
  • Adversarial robustness and automated red-teaming, including universal jailbreaks against held-out defenders, prompt injection, training via games, and stress-testing abuse defenses like burner-account and KYC checks.
  • Data poisoning and synthetic data work tied to pretraining and alignment.
  • Personality and character training and evaluation.
  • Safety training infrastructure, grader and eval tooling, and Codex or agent workflows that productionize safety research.
  • More conceptual work on the model spec, automated AI research, and concentration of power.

Mentors

Gabe Wu
OpenAI
,
Member of Technical Staff (Alignment Research)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Gabriel Wu is an AI alignment researcher at OpenAI. Previously, he directed the AI Safety Student Team at Harvard, where he earned a Master's degree in Computer Science and a bachelor's degree in Mathematics.

Read more
Dylan Sam
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Dylan is a safety researcher at OpenAI, where he works on curating better/safer training data and monitoring models for harmful behavior.

Before that he completed a PhD in the Machine Learning Department at CMU.

Read more
Kaiwen Wang
OpenAI
,
Researcher (Safety RL)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Kaiwen is a researcher at OpenAI working on AI Safety and RL. He earned my Ph.D. from Cornell Tech, where he researched and taught RL, causal inference, and LLMs.

He previously worked at Google, Microsoft, and Netflix on projects spanning core RL theory to scalable LLM algorithms. Before grad school, Kaiwen spent two years at Facebook building the RL Platform. 

Read more
Tom Dupre la Tour
OpenAI
,
Research Scientist (Interpretability)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Tom is a research scientist at OpenAI, working on interpretability of language models, for AI safety. He was also a core developer of scikit-learn between 2015 and 2022.

Read more
Isak Czeresnia Etinger
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Isak is a Member of Technical Staff at OpenAI. Previously a Software Engineer at Google, he worked on applications of computer vision, natural language processing, and LLMs.

Isak earned a Master of Computer Science at Carnegie Mellon University, with published work in natural language processing, style transfer, multilingual grapheme-to-phoneme modeling, and computer vision.

Read more
Xiangyu Qi
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Xiangyu is a researcher at OpenAI, where he works to make LLMs robust. Previously, he obtained his Ph.D. from Princeton University, advised by Prof. Prateek Mittal and Prof. Peter Henderson.

Read more
Joseph Millman
OpenAI
,
Member of Technical Staff (Detections and Response)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Joseph works in Detections and Response at OpenAI. His public security work includes using large language models to detect malicious macOS activity.

Read more
Juan Felipe Ceron Uribe
OpenAI
,
AI Alignment Research Engineer
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Juan is a researcher in OpenAI’s Safety Systems team. He is broadly interested in mitigating catastrophic risks. He works on adversarial robustness training and automated red-teaming (recent work https://openai.com/index/instruction-hierarchy-challenge/).

Read more
Hani Mir
OpenAI
,
Software Engineer
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Hani is a software engineer at OpenAI.

Read more
Christopher Choquette Choo
OpenAI
,
Research Scientist
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Christopher is Research Scientist on the Alignment team at OpenAI working on fundamental and applied research. His focuses are privacy-preserving and adversarial machine learning including memorization, privacy, and security harms in language modeling, auditing for risks and mitigating them.

Previously he was a Research Scientist at Google Deepmind and Google Brain on the Privacy and Security Research team. There, he led privacy and security evals for their frontier model efforts.

Read more
James Campbell
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

James is a Researcher at OpenAI working on model personality, post-training, and personalization.

Read more
Jason Wolfe
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Jason is a Member of Technical Staff at OpenAI working on alignment and model behavior. 

Read more
Ollie Matthews
OpenAI
,
Member of Technical Staff (Alignment Team)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Ollie is a researcher on OpenAI’s Alignment team interested in red-teaming and control. He was previously on the Control team at UK AISI. 

Read more
Micah Carroll
OpenAI
,
Member of Technical Staff (Safety Oversight Research)
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Micah is a researcher on OpenAI’s safety team interested in AI deception, scalable oversight, and monitorability. He is on leave from a UC Berkeley PhD focused on AI alignment with influenceable humans, AI manipulation from RL training, and recommender-system effects.

Read more
Bijan Varjavand
OpenAI
,
Technical Program Manager
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Bijan is a Technical Program Manager at OpenAI. He previously worked as a research engineer at Scale AI, where he coauthored work on LLM jailbreaking and red-teaming workflows.

Read more
Maja Trebacz
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Maja is a researcher at OpenAI, working on techniques for improving control and alignment as AI systems become more capable and agentic. Her team’s work combines longer-horizon research with hands-on deployment. They study long-term questions about how increasingly intelligent systems can be supervised, constrained, and corrected, while also building oversight systems that are used in practice today, both internally and externally (see recent work on code review and action monitoring for codex).

Read more
Sam Arnesen
OpenAI
,
Member of Technical Staff
SF Bay Area
Scalable Oversight
Control
Monitoring
Interpretability
Adversarial Robustness

Sam is a Research Engineer on OpenAI’s Alignment team. Previously worked in NYU’s Alignment Research Group on scalable oversight and as a Software Engineer at Amazon. His research includes training language models to win debates with self-play, and recent OpenAI work on auto-review for agent actions.

Read more
Tomek Korbak
OpenAI
,
Member of Technical Staff
SF Bay Area
Control
Monitoring

I’m a Member of Technical Staff at OpenAI working on monitoring LLM agents for misalignment. Previously, I worked on AI control and safety cases at the UK AI Security Institute and on honesty post-training at Anthropic. Before that, I did a PhD at the University of Sussex with Chris Buckley and Anil Seth focusing on RL from human feedback (RLHF) and spent time as a visiting researcher at NYU working with Ethan Perez, Sam Bowman and Kyunghyun Cho.

Read more

Mentorship style

Fellows we are looking for

Essential:

  • Strong research ability, technical judgment, and the capacity to execute a substantial project (paper, benchmark, dataset, or production-quality tooling) over the course of the fellowship.
  • Genuine interest in AI safety and in at least one of the project areas above. Fellows are matched to mentors based on fit, so concrete overlap with one or two mentors' interests matters more than broad interest.

Preferred (at least one of):

  • Empirical ML research experience with large language models, and strong Python and software engineering skills.
  • Prior research or applied work in one of the project areas above.
  • For more conceptual work on model spec, automated AI research, and concentration of power, strong writing and conceptual analysis skills instead of empirical ML.

Project selection