Provably safe AI
If we could encode a detailed world model and coarse human preferences in a formal language, it could be possible to formally verify that an AI-generated agent won’t take actions leading to catastrophe. Can we use frontier AI to help create such a detailed multi-scale world model, and/or to synthesize agents and proof certificates for small-scale demonstrations already?
Mentors
Yoshua Bengio
Founder, Mila; Professor, Université de Montréal
Yoshua is recognized worldwide as one of the leading experts in artificial intelligence and is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun.
-
Recognized worldwide as one of the leading experts in artificial intelligence, Yoshua Bengio is most known for his pioneering work in deep learning, earning him the 2018 A.M. Turing Award, “the Nobel Prize of Computing,” with Geoffrey Hinton and Yann LeCun.
He is Full Professor at Université de Montréal, and the Founder and Scientific Director of Mila – Quebec AI Institute. He co-directs the CIFAR Learning in Machines & Brains program as Senior Fellow and acts as Scientific Director of IVADO.
In 2019, he was awarded the prestigious Killam Prize and in 2022, became the most cited computer scientist in the world. He is a Fellow of both the Royal Society of London and Canada, Knight of the Legion of Honor of France, Officer of the Order of Canada, Member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology since 2023 and a Canada CIFAR AI Chair.
Concerned about the social impact of AI, he actively contributed to the Montreal Declaration for the Responsible Development of Artificial Intelligence. -
TBD
-
TBD
Matt MacDermott
PhD Student, Imperial College London
Matt MacDermott is a PhD student at Imperial College London, working with the Causal Incentives Working Group. He is currently interning at Mila working on Yoshua Bengio's proposal for cautious scientist AI.
-
Matt MacDermott is a PhD student at Imperial College London, working with the Causal Incentives Working Group. He is currently interning at Mila working on Yoshua Bengio's proposal for cautious scientist AI.
-
TBD
-
TBD
Jason Gross
Technical Lead, Special Project Funded ARC Theory
Jason Gross’s research aims to formalize mechanistic explanations of neural-net behavior to ensure reliable AI operations utilizing formal proofs and automated evaluations to enhance the transparency and accountability of AI systems.
-
Jason is a programming languages scientist. Previously, he worked on securing the internet and fighting with proof-assistants. Now, he leads a team working on formalizing mechanistic explanations of neural-net behavior. He is interested in both theory-heavy projects geared at guarantees and automated evaluation, and applied projects that use guarantees-compression to play with and understand models.
-
Summary of the proofs approach:
Motivation: To really discover what’s going on in large systems with emergent capabilities, we will develop automatically-generated explanations, including by delegating to AI systems. To make use of the automatically-generated explanations, we will need automatic evaluations/checking that cannot be Goodharted.
Premise: Explanations have two important properties: correspondence to the underlying system, and compression of the underlying system. Given a formal specification, correspondence is easy to achieve, however, without adequate compression we may never finish checking the explanation.
Question: How much can we compress the sprawling small noise in current model architectures? Empirically, noise seems unimpactful. But if we can’t compress it, maybe there’s an adversary hiding.
Perspective: Can we use compression based metrics to develop feedback loops for SOTA approaches like SAEs? The success of a technique is what it can explain, and we have a repertoire of tools for analyzing precisely what has been explained.
Researchers are driven by different goals day-to-day: from seeking fundamental understanding of how learning happens, to fixating on the few test cases that a NN fails to find good solutions to. Projects go best when you find something that you are inspired to own. If you’re enthusiastic about the proofs approach, I am happy to work with you to find a project that will set you up to succeed.
Some possible projects:
Train sparse autoencoders (SAEs) on a transformer trained on the algorithmic max-of-K task (see here or here) exploring the hypothesis that SAEs can give optimal case analysis structure for proofs. Questions:
Does using a combinatorial/multipicative sparsity penalty rather than an additive one give better results?
Where in the network do we want to insert SAEs to recover case analysis structure?
Would it be better to train SAEs that use rectified linear step functions rather than rectified linear unit functions (ReLU) to capture features that are not binary (on-off), such as the size direction in max-of-K?
(More open-ended) what does the proofs frame have to say about SAEs, and what can we learn about compact guarantees by playing with SAEs?
Toy models of structure incrementalism: Vertebrates have a blind spot because the optic nerve punches through the retina before attaching from the front. Evolution has not solved this problem because every incremental step needs to make the organism more fit, and discontinuous jumps won’t happen. Similarly, complex structures won’t arise except as a series of incrementally useful steps. Induction heads should form a toy model of this, where the absence of useful trigram statistics in the training data result in the inability to form the first layer of an induction head. Questions:
Are induced trigram statistics (from bigram statistics / markov chain generators) sufficient, or do the trigram statistics need to have predictive power even after screening off the value of the current token?
Is there a relationship between strength of usefulness of trigram statistics and speed of induction head formation?
We need to somehow measure speed of induction head formation
We need to measure strength of trigram statistics
Developing techniques to compress noise on more models (each training objective or model config can be a single “project” — formal proofs are hard!):
Questions:
What are the challenges introduced by each new kind of operation (multiple heads, multiple layers, layernorm)?
Is there a uniformity to the kind of model-specific insight required for noise-compression to various asymptotic bounds, even across models?
Develop expository tutorials for the proofs approach to mech interp, for instance Sorted List (1L, 2H, LN, attn-only) attempt for ARENA mech interp monthly challenges.
Get LLMs to formalize the proofs in proof assistants like Lean or Coq.
My projects typically involve a combination of tasks:
Training toy models with PyTorch, sometimes including thinking about and adjusting the data generating procedure or hyperparameters
Exploring small trained models
Constructing and evaluating hypotheses for mechanistic explanations
Writing programs to bound the worst-case divergence of a concrete model from an estimate computed according to a mechanistic explanation
Iterating on the programs to produce non-vacuous bounds
Analyzing the computational complexity of the worst-case divergence computation
Writing up proof sketches or formal proofs that a given bound computation is correct
-
In addition to the skills and interests you would need for mech interp generally, the following are important for enjoying working on the proofs approach:
You like at least one of the following: complexity theory, analysis, number theory, formal logic, or bashing inequalities. (And if you don’t, you have a long explanation for it.)
You have developed mathematical fluency via research, competition math, or a rigorous undergrad program.
You have patience to wrestle with code. You don’t have to be a top engineer, but you will need the skills that are picked up by contributing to a large software project with many moving parts, or working through the tedious parts of ARENA.
My mom is a pre-K teacher—most of her job is to empower students and nurture their vital curiosity. This is also my mentorship philosophy, and I am excited to work with researchers who are both opinionated and actively seeking feedback through action.
I’ve had success with the following structure for previous fellowship cohorts:
You own a clear, well-defined (component of a) project.
You have scheduled co-working blocks or pairing sessions with collaborators.
Biweekly group meeting where everybody presents interim results.
Weekly 1:1 time of 1h (+2h as needed) to discuss research plan, stare at code, brainstorm, provide feedback on write ups, or provide other general support.
Variable availability via slack.
Sören Mindermann
Postdoctoral Researcher, Mila
Sören is a postdoctoral researcher AI safety and alignment at Mila working with Yoshua Bengio. He recently helped write and create the writing team for the International AI Safety Report.
-
Sören is a postdoctoral researcher AI safety and alignment at Mila working with Yoshua Bengio. He recently helped write and create the writing team for the International AI Safety Report. He completed his PhD at the University of Oxford under the mentorship of Professor Yarin Gal, working on active learning, causal inference, COVID-19, LLM honesty, and conceptualizing the alignment problem. Before his PhD, he focused on learning human preferences and game-theoretical machine learning under David Duvenaud and Roger Grosse at the Vector Institute and at UC Berkeley. Additionally, he worked as a fellow in the Governance of AI group at Oxford.
-
TBD
-
TBD
Samuel Buteau
Research Scientist, Mila
Mauricio specializes in the technical aspects of AI governance, focusing on compute governance and ensuring compliance with AI regulations. His research integrates technical insights with policy frameworks to manage and verify AI system operations effectively.
-
Samuel Buteau is a Research Scientist at Mila working on the cautious AI scientist agenda with Yoshua Bengio.
-
TBD
-
TBD