AI Agency

Andi Peng
Research Scientist, Anthropic

Andi Peng leads national security safety evaluations at Anthropic and is a PhD student at MIT.

Andi Peng leads national security safety evaluations at Anthropic and is a PhD student at MIT. Her research has primarily focused on learning good representations and preference models for RL agents and robots.
The mentors for this stream will be Andreea Bobu and Andi Peng, with Andreea Bobu being the primary mentor.
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivations are ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. The work that we do involves reward learning, learning from (potentially multiple) kinds of human feedback, active learning, representation learning, quantifying misalignment.
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivation is ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. This is challenging for many reasons. Values are notoriously difficult to specify to our AI agents. Not only that, but learning values (for instance, as reward functions) — which is the current popular alternative to specifying them by hand — is also difficult: 1) getting the right data to supervise the learning (via RLHF-style methods) is nontrivial because humans are imperfect, not infinitely queryable, and have unique and changing preferences and values; 2) the representations we choose to mathematically express values may themselves be misaligned, thus preventing us from ever being able to capture the “true” desired values; 3) reliably quantifying misalignment to be able to robustly tell when the AI system is safe for operation is still under explored. To tackle these challenges, in our research we bring cross-disciplinary expertise from deep learning, human-robot interaction, mathematical human modeling, cognitive psychology, and probabilistic uncertainty, with the hope of creating more aligned, generalizable, and robust learning algorithms and AI systems.
With this context in mind, research topics can be flexible depending on interests and fit but we am broadly interested in the following three thrusts of work (which you can also read more about on our lab website):
- Getting (enough of) the right data to align our AI agents’ values with humans. Typical RLHF methods treat humans as infinitely queryable oracles. While we do have large amounts of data on the internet, each individual human the AI agent will interact with will have unique and changing preferences, values, and biases that canned internet data alone may not reflect. Since we want AI agents to quickly adapt and align their existing values with each human user, how can we relax the infinitely queryable oracle assumption?
- How do we create data efficient algorithms that ask for the right input (maximize information gain) while not over-burdening the human (minimize human effort)? Can we make use of active learning techniques that combine these objectives in order to minimize the amount of queries required from the human?
- What kind of input do we want to ask humans for? Preference queries are low effort but contain relatively little information for the agent. Could we enhance the information gain by appending targeted explanations to the preferences (e.g. “I prefer A to B because A has less political content”)? Could we combine different feedback modalities (showing inputs like examples of correct behavior vs telling inputs like language corrections) for increased agent alignment?
- How do we make use of large pre-trained models, while still adapting to the specific human's needs? LLMs contain very powerful human priors that we have found to substantially reduce human burden when learning how to execute tasks from them. Can we make use of these LLM priors to amplify the data we receive from humans? Can we prompt LLMs to think about the reason underlying a human input or decision, and then use that causal reason to generate new situations where the human would likely behave in a similar way? Can we use LLMs to elicit preferences from humans?
- Aligning agent representations with humans for increased downstream value alignment. We show in recent work that if the representations we choose to express values are misaligned with those of humans, the downstream learned values will also be inevitably misaligned. Misaligned representations lead to unintended, unexpected, and even catastrophic behavior. How can we learn agent representations of the world that are aligned and in agreement with those of the humans they are cooperating with? Our previous efforts have looked at methods that learn representations one concept at a time vs all-at-once, but each approach has important tradeoffs: one concept at a time leads to more interpretable and structure-inducing representations, but they require the human to think of and explicate every dimension of their representation; all-at-once is more scalable and effortless but results in more entangled and less interpretable representations. Can we somehow get the best of both worlds: can we obtain more interpretable and disentangled representations while making it easier for the human to teach the AI agent about their representation? Moreover, if these representations are indeed interpretable, can we use them to facilitate smoother cooperation and communication between humans and their AI agent?
- Reliably quantifying misalignment. In order to even know when to stop current (potentially unsafe) behavior and initiate the process of alignment with the human, the AI system needs to be able to quantify misalignment. Our past work has looked at quantifying misalignment on small Bayesian models, but there is little principled work on quantifying misalignment on large models like LLMs. How can we quantify and detect misalignment by monitoring inconsistencies between the human’s feedback and the AI agent’s behavior? How can AI agents disentangle between inconsistencies due to incorrect models vs due to noisy human judgements? How should AI agents resolve that misalignment once detected? Should AI systems always assume that the human is right and thus they need to adjust their model, or should there be a debate process where the human and the agent arrive at a shared model of the world together?
The best candidates are those who are passionate about exploring how thinking about the human element (designers, end users, or stakeholders of AI) in the human-AI interaction equation can lead to building more aligned, generalizable, and robust learning algorithms and AI systems.
In terms of hard skills, ideal candidates should have one or more of the following:
- Previous research experience in machine learning (or related fields), ideally with published work.
- Strong software engineering background and familiarity with machine learning tools and techniques.
- Experience with running human subject experiments and statistical analyses is a plus.
As for soft skills:
- The ability to communicate clearly.
- Strong intrinsic motivation and curiosity for learning.
- A desire to publish an academic paper.
- A sense for how to build a story is useful, in that doing good impactful research also involves communicating said research in a way that the community is drawn to and compelled by.

Michael Cohen
Postdoc, UC Berkeley CHAI

I mainly work on developing solutions-in-theory to the control problem.

I mainly work on developing solutions-in-theory to the control problem. What I mean by a solution-in-theory is:
1. Could do superhuman long-term planning
2. Ongoing receptiveness to feedback about its objectives
3. No reason to escape human control to accomplish its objectives
4. No impossible demands on human designers/operators
5. No TODOs when defining how we set up the AI’s setting
6. No TODOs when defining any programs that are involved, except how to modify them to be tractable
You can see introductions to my work on this topic at michael-k-cohen.com/blog.
How can we solve the control problem in theory, with fewer drawbacks than existing solutions-in-theory? How can we turn existing solutions-in-theory into solutions-in-practice? See michael-k-cohen.com/blog.
I would like to supervise research into how to avoid, in theory or in practice, any of the assumptions in this paper: https://onlinelibrary.wiley.com/doi/10.1002/aaai.12064. I'd like to support research in the "eliciting latent knowledge" agenda. I'm happy to support any interesting alignment research.
They should have some research experience.

Max Kleiman-Weiner
Assistant Professor, University of Washington

Max Kleiman-Weiner is an Assistant Professor at the University of Washington. His research focuses on modeling human social and moral intelligence and building aligned and cooperative artificial intelligence.

Max Kleiman-Weiner is an Assistant Professor at the University of Washington, where he is the PI of the Computational Minds and Machines lab. His research focuses on modeling human social and moral intelligence and building aligned and cooperative artificial intelligence. He completed a PhD in Computational Cognitive Science at MIT, advised by Josh Tenenbaum, where he was an NSF and Hertz Foundation Fellow. His thesis on the computation principles of human social intelligence won the Robert J. Glushko Prize for Outstanding Doctoral Dissertation in Cognitive Science.
I'm interested in research questions at the intersection of AI and game theory using tools from (deep) reinforcement learning or LLM-based models of agency. Questions that I'm excited about include: understand how and when cooperation emerges from self-interested interactions, multi-agent formalisms of strategic and situational power, and how to best optimize the autonomy (rather than the utility) of other agents.
Here are three "seeds" for possible projects that I believe we can jointly develop into a successful research program.
1. Study the conditions for the emergence of cooperation between self-interested LLM-based agents where agents both produce short programs that define behaviors and condition their output on both their partners code and the success or failure of their prior program. Compare LLM-based agents to some of the results developed in "open-source" game theory.
2. Extend recent models of power to multi-agent environments. Can we predict when a particular agent will be more powerful than another in a given environment or infer when one agent is exercising power over another?
3. How can we incentivize AI agents to optimize for our autonomy rather than just human reward? One can imagine that by taking the limit of AI capabilities, even a helpful AI system will solve all of our challenges rather than allow us to learn through exploration and trial and error learning. Can we formalize autonomy in such a way that an AI agent will expand human agency rather than replacing it? This will involve thinking about models of empowerment and AI-human interaction.
All of my work is computational and empirical. This requires a significant degree of maturity in terms of software engineering and knowledge of the relevant machine-learning techniques. Familiarity with game theoretic concepts such as "best response" and "equilibrium" is a major plus or at least interest in learning these ideas. Students who are interested in LLM-based agents should have at least some experience running experiments with open-source LLMs.

Alex Cloud
Independent Researcher

Alex is developing gradient routing, a novel method for training selective modularity into neural networks.

Alex Cloud is an independent researcher developing gradient routing, a novel method for training selective modularity into neural networks that has potential applications in interpretability, robust unlearning, and scalable oversight. Previously, he conducted applied research in reinforcement learning at Riot Games AI and Amazon. He holds a PhD in Statistics from North Carolina State University, where he was advised by Eric Laber.
Alex Cloud was a core member of the team that developed gradient routing during Alex Turner's Summer 2024 MATS cohort. Now, he is excited to support Alex Turner as a co-mentor for the upcoming stream.
This stream's mentors will be Alex Turner and Alex Cloud, with Alex Turner being the primary mentor.
In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation. If you're theory-minded, maybe you'll help us formalize shard theory itself.
Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing (forthcoming) potentially unlocks the ability to isolate undesired circuits to known parts of the network, after which point they can be ablated or studied.
What other subfields can we find together?
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Ideal candidates would have:
- Academic background in machine learning, computer science, statistics, or a related quantitative field.
- Familiarity with ML engineering.
- Proven experience working on machine learning projects, either academically or professionally.
- Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.
- Ability to write up results into a paper.
Mentorship looks like:
- Weekly meetings (1-on-1 and group)
- Slack presence

Andreea Bobu
Assistant Professor, MIT CSAIL

Andreea’s work looks at 1) getting the right data to supervise agents, whether directly from people or via priors; 2) enabling humans and robots to efficiently and interactively arrive at shared task representations; 3) quantifying and addressing misalignment caused by different human modeling choices.

Andreea Bobu is an Assistant Professor at MIT in AeroAstro and CSAIL. She leads the Collaborative Learning and Autonomy Research Lab (CLEAR Lab), where they develop autonomous agents that learn to do tasks for, with, and around people. Her goal is to ensure that these agents' behavior is aligned with human expectations, whether they interact with expert designers or novice users. She obtained her Ph.D. in Electrical Engineering and Computer Science at UC Berkeley with Anca Dragan in 2023. Prior to her Ph.D. she earned her Bachelor’s degree in Computer Science and Engineering from MIT in 2017. She was the recipient of the Apple AI/ML Ph.D. fellowship, is a Rising Star in EECS and an R:SS and HRI Pioneer, and has won best paper award at HRI 2020 and the Emerging Research Award at the International Symposium on the Mathematics of Neuroscience 2023. Before MIT, she was also a Research Scientist at the AI Institute and an intern at NVIDIA in the Robotics Lab.
The mentors for this stream will be Andreea Bobu and Andi Peng, with Andreea Bobu being the primary mentor.
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivations are ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. The work that we do involves reward learning, learning from (potentially multiple) kinds of human feedback, active learning, representation learning, quantifying misalignment.
We are broadly interested in AI agents learning to do tasks for, with, and around humans. Our main research motivation is ensuring that these agents are value aligned with the humans they are meant to support, whether the human is an expert designer, a novice end user, or a stakeholder of the AI system. This is challenging for many reasons. Values are notoriously difficult to specify to our AI agents. Not only that, but learning values (for instance, as reward functions) — which is the current popular alternative to specifying them by hand — is also difficult: 1) getting the right data to supervise the learning (via RLHF-style methods) is nontrivial because humans are imperfect, not infinitely queryable, and have unique and changing preferences and values; 2) the representations we choose to mathematically express values may themselves be misaligned, thus preventing us from ever being able to capture the “true” desired values; 3) reliably quantifying misalignment to be able to robustly tell when the AI system is safe for operation is still under explored. To tackle these challenges, in our research we bring cross-disciplinary expertise from deep learning, human-robot interaction, mathematical human modeling, cognitive psychology, and probabilistic uncertainty, with the hope of creating more aligned, generalizable, and robust learning algorithms and AI systems.
With this context in mind, research topics can be flexible depending on interests and fit but we am broadly interested in the following three thrusts of work (which you can also read more about on our lab website):
- Getting (enough of) the right data to align our AI agents’ values with humans. Typical RLHF methods treat humans as infinitely queryable oracles. While we do have large amounts of data on the internet, each individual human the AI agent will interact with will have unique and changing preferences, values, and biases that canned internet data alone may not reflect. Since we want AI agents to quickly adapt and align their existing values with each human user, how can we relax the infinitely queryable oracle assumption?
- How do we create data efficient algorithms that ask for the right input (maximize information gain) while not over-burdening the human (minimize human effort)? Can we make use of active learning techniques that combine these objectives in order to minimize the amount of queries required from the human?
- What kind of input do we want to ask humans for? Preference queries are low effort but contain relatively little information for the agent. Could we enhance the information gain by appending targeted explanations to the preferences (e.g. “I prefer A to B because A has less political content”)? Could we combine different feedback modalities (showing inputs like examples of correct behavior vs telling inputs like language corrections) for increased agent alignment?
- How do we make use of large pre-trained models, while still adapting to the specific human's needs? LLMs contain very powerful human priors that we have found to substantially reduce human burden when learning how to execute tasks from them. Can we make use of these LLM priors to amplify the data we receive from humans? Can we prompt LLMs to think about the reason underlying a human input or decision, and then use that causal reason to generate new situations where the human would likely behave in a similar way? Can we use LLMs to elicit preferences from humans?
- Aligning agent representations with humans for increased downstream value alignment. We show in recent work that if the representations we choose to express values are misaligned with those of humans, the downstream learned values will also be inevitably misaligned. Misaligned representations lead to unintended, unexpected, and even catastrophic behavior. How can we learn agent representations of the world that are aligned and in agreement with those of the humans they are cooperating with? Our previous efforts have looked at methods that learn representations one concept at a time vs all-at-once, but each approach has important tradeoffs: one concept at a time leads to more interpretable and structure-inducing representations, but they require the human to think of and explicate every dimension of their representation; all-at-once is more scalable and effortless but results in more entangled and less interpretable representations. Can we somehow get the best of both worlds: can we obtain more interpretable and disentangled representations while making it easier for the human to teach the AI agent about their representation? Moreover, if these representations are indeed interpretable, can we use them to facilitate smoother cooperation and communication between humans and their AI agent?
- Reliably quantifying misalignment. In order to even know when to stop current (potentially unsafe) behavior and initiate the process of alignment with the human, the AI system needs to be able to quantify misalignment. Our past work has looked at quantifying misalignment on small Bayesian models, but there is little principled work on quantifying misalignment on large models like LLMs. How can we quantify and detect misalignment by monitoring inconsistencies between the human’s feedback and the AI agent’s behavior? How can AI agents disentangle between inconsistencies due to incorrect models vs due to noisy human judgements? How should AI agents resolve that misalignment once detected? Should AI systems always assume that the human is right and thus they need to adjust their model, or should there be a debate process where the human and the agent arrive at a shared model of the world together?
The best candidates are those who are passionate about exploring how thinking about the human element (designers, end users, or stakeholders of AI) in the human-AI interaction equation can lead to building more aligned, generalizable, and robust learning algorithms and AI systems.
In terms of hard skills, ideal candidates should have one or more of the following:
- Previous research experience in machine learning (or related fields), ideally with published work.
- Strong software engineering background and familiarity with machine learning tools and techniques.
- Experience with running human subject experiments and statistical analyses is a plus.
As for soft skills:
- The ability to communicate clearly.
- Strong intrinsic motivation and curiosity for learning.
- A desire to publish an academic paper.
- A sense for how to build a story is useful, in that doing good impactful research also involves communicating said research in a way that the community is drawn to and compelled by.

Alex Turner
Research Scientist, Google DeepMind

Alex is conducting research on "steering vectors" to manipulate and control language models at runtime, alongside exploring mechanistic interpretability of AI behaviors in maze-solving contexts and the theory of value formation in AI.

Alex is a Research Scientist at Google DeepMind. He’s currently working on steering vectors on DeepMind internal models. He's currently working on algebraically modifying the runtime properties of language models and doing mechanistic interpretability on maze-solving agents (this work was done by the MATS Winter 2022 Cohort’s team shard, mechanistic interpretability subteam). He also sometimes does theoretical work on the shard theory of value formation. In the past, he formulated and proved the power-seeking theorems, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.
Highlight outputs from past streams:
- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)
  - Introduced the now-staple technique of “steering vectors”
- Steering GPT-2-XL by adding an activation vector (MATS 3.1, paper)
- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)
- To be published from MATS 5.0:
  - Unsupervised discovery of model behaviors using steering vectors (Andrew Mack)
  - Targeted latent adversarial training (Aidan Ewart)
This stream's mentors will be Alex Turner and Alex Cloud, with Alex Turner being the primary mentor.
In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation. If you're theory-minded, maybe you'll help us formalize shard theory itself.
Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing (forthcoming) potentially unlocks the ability to isolate undesired circuits to known parts of the network, after which point they can be ablated or studied.
What other subfields can we find together?
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Ideal candidates would have:
- Academic background in machine learning, computer science, statistics, or a related quantitative field.
- Familiarity with ML engineering.
- Proven experience working on machine learning projects, either academically or professionally.
- Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.
- Ability to write up results into a paper.
Mentorship looks like:
- Weekly meetings (1-on-1 and group)
- Slack presence
- Limited support otherwise (per Google DeepMind rules, I can’t contribute code to the projects)

Mentors

Andi PengResearch Scientist, Anthropic

Michael CohenPostdoc, UC Berkeley CHAI

Max Kleiman-WeinerAssistant Professor, University of Washington

Alex CloudIndependent Researcher

Andreea BobuAssistant Professor, MIT CSAIL

Alex TurnerResearch Scientist, Google DeepMind