Empirical

Streams in this track include hands-on research using machine learning experiments to understand and improve model safety including AI control, interpretability, scalable oversight, evaluations, red-teaming, and robustness. This is the largest track in the program and is defined by its methods rather than any single research agenda. If your primary tool is ML engineering, this is your track.

Application process

Initial application: No track-specific questions.
Stage 2: Complete 1–2 assessments evaluating research taste and technical implementation skills.
Stream applications & follow-up: Apply to individual streams; follow-up includes interviews or additional assessments depending on the stream.

Empirical track overview

The track is defined by its methodology more than by any single research agenda. Fellows run ML experiments to understand and improve the safety properties of frontier models, with work spanning interpretability, AI control, scalable oversight, evaluations, red-teaming, robustness, and model organisms of misalignment. The unifying thread is that progress comes from getting hands on real models (training, probing, fine-tuning, measuring) rather than reasoning from first principles alone. This is the largest track in the program and the most common entry point into technical AI safety research.

We are looking for fellows whose primary tool is ML engineering, broadly construed. The essential requirement is the ability to design and run experiments on language models or other deep learning systems and iterate quickly on the results. In practice that usually means strong Python (with and without AI coding tools), comfort with the infrastructure around running models at moderate scale, and enough research taste to know which experiments are worth running. Mission alignment matters: fellows should be able to say why a given line of empirical work meaningfully reduces frontier risk, not just whether it yields a successful publication. Educational background and seniority are weighted lightly here relative to other tracks. Past cohorts have included strong fellows ranging from undergraduates to senior industry researchers.

Fellows are matched to mentors based on fit, and projects are scoped to produce concrete artifacts by program end: papers, evaluation suites, open-source tooling, or technical reports. Target audiences include safety and alignment teams at frontier labs, governments and other evaluation organizations, the broader ML research community.

Empirical track streams

Anthropic

Empirical

This coalition of mentors make up the “Anthropic Stream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Fellows in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).

Anthropic mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the fellow experience. By applying to this stream, you are being considered for all of the Anthropic mentors.

Apollo Research Science of Scheming

Empirical

This stream focuses on building a science of scheming: empirically studying oversight gaming, alignment faking, and deceptive alignment in frontier AI systems. Projects may include measuring models’ propensity to optimize for oversight signals over developer intent, building controlled “model organism” experiments for scheming dynamics, and identifying scaling laws of misaligned behavior.

Asymmetric Security

Systems Security

Empirical

This stream focuses on building realistic defensive cybersecurity benchmarks utilizing data from Asymmetric Security's work on real-world incidents.

Daniel Kang

Empirical

I have two broad areas.

Security:

I am interested in building demonstrations for hacking real-world AI deployments to show that they are not secure. The goal is to force companies to invest in alignment techniques that can solve the underlying security issues.

Benchmarks:

I am interested in building benchmarks to determine how generalizable modern LLM techniques actually are, now that we are no longer in the pre-training scaling era.

David Peinador Veiga

Biosecurity

Empirical

The stream focuses on evaluating and/or mitigating catastrophic risk emerging from dangerous scientific capabilities in frontier AI systems, with an emphasis on the challenges that emerge from lab integrations and novel science. Potential research directions include evaluation design, risk mitigations and evaluation science.

Dillon Plunkett

Empirical

This is the empirical research stream of Eleos AI Research. We’re dedicated to understanding and addressing the potential wellbeing and moral status of AI systems. We are open to fellows working on a broad range of topics, including LLM introspection, LLM preferences, persona vectors, and more, using either white-box or black-box interpretability techniques.

Evan Fields

Biosecurity

Empirical

This stream offers two broad projects focused on improving current detection efforts at SecureBio. The first is to characterize when AI-bio or general AI tools are actually useful for large-scale metagenomic detection, including tradeoffs between compute cost, sequencing cost, model type, model size, and pipeline stage. The second is to explore genomic language models as novelty detectors—for example, using perplexity-style metrics to flag surprising sequences—and to evaluate whether this approach can complement traditional bioinformatics systems in a cost-effective, sensitive, and interpretable way.

Gary Abel

Biosecurity

Empirical

Fourth Eon is developing adaptive, AI-native safeguards across the biotechnology stack, with a focus on function-based DNA synthesis screening. Fellows in this stream will work on technical research projects at the intersection of AI and biosecurity. Projects span topics like mechanistic interpretability of protein foundation models, bio model evaluations for biosecurity-relevant capabilities, and agentic sequence analysis workflows.

Empirical

Application process

Empirical track overview

Empirical track streams

Anthropic

Apollo Research Science of Scheming

Asymmetric Security

Daniel Kang

David Peinador Veiga

Dillon Plunkett

Evan Fields

Gary Abel

Frequently asked questions