AI Value Alignment

As artificial intelligence continues to evolve, the integration of advanced AI systems into societal frameworks is becoming increasingly prevalent. Can we develop robust principles that ensure these systems operate in harmony with human values and ethics, preventing adverse outcomes and promoting mutual benefits?

Mentors

Brad Knox
Research Associate Professor, UT Austin

Brad Knox focuses on developing reinforcement learning algorithms that are better aligned with human values and preferences. His research aims to refine how AI systems interpret and integrate human feedback to reflect genuine human interests.

Brad is a Research Associate Professor of Computer Science at UT Austin. With particular focus on humans, he researches how to specify aligned problems for reinforcement learning algorithms (i.e., outer alignment or value alignment). Brad is well known for his early RLHF research on the TAMER framework. Recent work is at his website.
Much computer science research makes mathematically convenient, but unquestioned, assumptions about humans, often providing a misaligned foundation upon which algorithms are derived that result in misaligned AI. In addition to computer science, I draw from multiple fields (e.g., psychology, human-computer interaction, economics), with the hope of creating less naive methods for greater alignment with humans.
My research has two thrusts: (1) deriving and deploying RLHF based upon improved psychological models of how people give preferences and (2) the design of reward functions and other evaluation metrics that are aligned with human preferences over outcomes.
I have a list of specific projects within these thrusts and am also open to being pitched projects that are closely connected to my research area.
Depending on the project, you may use an LLM or VLM as a proxy for humans. For instance, an RLHF project might use actual humans or LLMs as the source of preferences, and we might look at how context during or before preference elicitation affects preferences.
You find compelling the idea of studying the designers, users, or stakeholders of AI to build better learning algorithms and methods for specifying learning problems.
Ideal candidates have:
- software engineering experience
- strong math and analytical skills
- the ability to communicate clearly
- a desire to publish an academic paper
- expertise in RL
Having previous first-author research publications is strongly recommended. Nice-to-haves include expertise in human subjects study design and/or being located in Austin.
On the softer side, some traits I consider important in research collaborators include strong motivation to deeply understand the problems one works on and a visceral desire to get even the details correct. Also, a sense of how to construct a narrative is useful, in that communicating one's research compellingly resembles creating narratives in more traditional storytelling.
Also, review my recent research in the link in my bio. You should genuinely find it interesting if we're going to collaborate successfully.
Mentorship will look like:
- Weekly meetings with me, for 1 hour early on and possibly for 30 minutes once you are showing more autonomy
- Frequent communication on Slack, with most responses < 48 hours. You can (and should) ping me if I ever bottleneck your work.
- A good amount of background reading to start
- I do not typically look at mentees' code but do engage closely with debugging experimental results
- Detailed feedback and revisions on your paper draft

Micah Carroll
PhD Student, UC Berkeley CHAI

Micah focuses on AI alignment under changing preferences and values, with interests in how AI influences human behavior and optimizing AI-human interaction dynamics.

Micah Carroll is an AI PhD Student at CHAI (UC Berkeley) advised by Stuart Russell and Anca Dragan. Micah's research focuses on AI Alignment under changing preferences and values. In the past, he has been interested in recommender systems incentives to shift user preferences, and AI manipulation more broadly. For more information visit his website.
Research topics can be flexible depending on interests and fit, but broadly I'm most interested in:
- AI alignment with changing and malleable values and preferences. How should we align AI to humans’ changing values and preferences (which can be influenced by AI systems)? How will AI propensities to influence humans – exploiting their feedback (e.g. as with sycophancy) – develop with multi-step RLHF or LLM-agent setups? How well do current LLMs know what kind of influences towards humans are beneficial and harmful? How can we best leverage their knowledge in this domain? Can we use them to better elicit meta-preferences?
- Reward hacking LM reward models for sequential decision-making RL agents. Recent work has used VLMs and LLMs as reward models for training RL agents in sequential decision-making environments. Somewhat surprisingly, studies so far have not found much evidence of reward hacking in these settings. While these models definitely can be reward hacked, they may be more robust to overoptimization than other training setups – it would be interesting to understand why. How does susceptibility to reward hacking depend on 1) the size of the reward model, 2) the size of the action space of the agent, and potentially 3) other properties of the decision-making environment?
- Learning models of human feedback suboptimality, and extrapolating beyond them. A problem that affects many AI Alignment schemes is to develop accurate models of human suboptimality, enabling correct extrapolation to what we want, given the feedback we provide the AI. While the task of jointly learning models of human suboptimality and what humans value is challenging, can we leverage the priors contained in LLMs? More generally, are there ways we can better disambiguate human biases, human values, and changes in values (or biases)?
I'd be particularly excited by candidates who have one or more of the following:
- A strong software engineering background, and at the very least conceptual familiarity with machine learning’s key ideas
- Previous research experience, ideally in machine learning and with LLMs
- Experience with running human subjects experiments may also be helpful, depending on the project
My mentorship style is quite flexible, but I would likely want to meet at least once a week (and happy to do more if the pace of progress warrants it). My Slack response time is typically <24h.

Matija Franklin
Researcher, AI Objectives Institute

Matija focuses on developing nuanced methods for eliciting human feedback to train better reward models for AI alignment. Matija contributed to the EU AI Act and is exploring standardized non-financial disclosure frameworks to improve corporate AI practices.

Matija Franklin is a researcher in AI Alignment and Policy, who has recently completed a PhD at UCL. He conducts interdisciplinary research and has over 30 academic publications. Last year, Matija worked on manipulation evals, working on the open-source dangerous capabilities evals project at OpenAI, as well as a contractor at Google DeepMind, and an advisor to the UK AI Safety Institute. His policy work on manipulation and general purpose AI systems has been adopted by the EU AI Act. Currently, he focuses on developing better methods for eliciting human feedback for the purpose of training better rewards models for alignment. He is also researching how a standardised non-financial disclosure framework could promote better corporate practice around the development and use of AI.
AI Risk, as understood by AI alignment and ethics researchers, is present in the economy but often not measured. As it remains unidentified, AI risk isn’t “priced in '' and thus does not influence the allocation of capital. If it were, it would radically change the incentive landscape of AI development and deployment. Ongoing efforts at the UN and within financial regulators seek to mandate this in the near future. Can we develop a standardised measure of AI Risk and leverage existing governance institutions to build an international consensus around it?
Ideal candidates would possess one or more of the following:
- Strong understanding of AI Risk both technical and socio-technical
- Actuarial, insurance or risk management experience
- Strong understanding of Economics particularly microeconomics/financial economics
- Corporate Procurement experience/Asset Management Experience
- Management Consulting / MBA
- Financial regulation / Central Banking
The project requires:
- Inclination towards interdisciplinarity
- Being able to select and apply different scientific paradigms to solve complex problems (eg. ML and Cog Sci / Business and Political Strategy)
- Ability to switch between systems and analytic thinking
- Conscientiousness, self-starter, ability to identify problems
Mentorship for scholars will likely be:
- 1 hour weekly meetings with each scholar individually
- Team meetings every 4 weeks
- Slack response time typically ≤ 48 hours

Philip Moreira Tomei
Researcher, AI Objectives Institute

Philip specializes in applying cognitive science to complexity theory and leads research on AI risk in capital markets at the AI Objectives Institute. His current projects include shaping AI governance through the All-parliamentary policy group at Chatham House and developing standardized measures of AI risk to influence global financial systems.

Philip Moreira Tomei has a background in Cognitive Science with a specialisation in its applications in Complexity theory. He currently leads research in AI risk and capital markets at the AI Objectives Institute and is an incoming fellow at Antikythera. Philip is currently building the All-parliamentary policy group for AI Governance at Chatham House, the Royal Institute of International Affairs. He previously worked with The Future Society on the EU AI Act and published research on its implications for AI Persuasion and Manipulation with the OECD. His work on AI Governance has been published at NeurIPS with Pax Machina Research.
AI Risk, as understood by AI alignment and ethics researchers, is present in the economy but often not measured. As it remains unidentified, AI risk isn’t “priced in '' and thus does not influence the allocation of capital. If it were, it would radically change the incentive landscape of AI development and deployment. Ongoing efforts at the UN and within financial regulators seek to mandate this in the near future. Can we develop a standardised measure of AI Risk and leverage existing governance institutions to build an international consensus around it?
Ideal candidates would possess one or more of the following:
- Strong understanding of AI Risk both technical and socio-technical
- Actuarial, insurance or risk management experience
- Strong understanding of Economics particularly microeconomics/financial economics
- Corporate Procurement experience/Asset Management Experience
- Management Consulting / MBA
- Financial regulation / Central Banking
The project requires:
- Inclination towards interdisciplinarity
- Being able to select and apply different scientific paradigms to solve complex problems (eg. ML and Cog Sci / Business and Political Strategy)
- Ability to switch between systems and analytic thinking
- Conscientiousness, self-starter, ability to identify problems
Mentorship for scholars will likely be:
- 1 hour weekly meetings with each scholar individually
- Team meetings every 4 weeks
- Slack response time typically ≤ 48 hours

Tsvi Benson-Tilsen
Researcher, MIRI

Tsvi’s research focuses on understanding core concepts of agency, mind, and goal-pursuit to innovate in the area of AI intent alignment, incorporating philosophical rigor with practical constraints and desiderata essential for developing corrigible strong minds.

Tsvi Benson-Tilsen works on the foundations of rational agency, including logical uncertainty, logical counterfactuals, and reflectively stable decision making, as well as other questions of AI alignment. Before joining MIRI as a full-time researcher, he collaborated on “Logical Induction”. Tsvi holds a BSc in Mathematics with honors from the University of Chicago, and is on leave from the UC Berkeley Group in Logic and the Methodology of Science PhD program. Tsvi joined MIRI in June 2017.
The project is to do speculative analytic philosophy to core concepts about agency, mind, and goal-pursuit, to pave the way to address the hard problem of AGI intent alignment. We'll bring in criteria (constraints and desiderata) from the nature of agency and from the engineering goal of creating a corrigible strong mind. Example constraint: by default, if a mind is highly capable then it also quickly increases its capabilities. Example desideratum: for some mental element to determine a mind's ultimate effects, it has to be stable under pressures from reflective self-modification; so we'd like a concept of core effect-determiners that describes mental elements which can be stable. We'll look at the demands that these criteria make on our concepts, and find better concepts. See "A hermeneutic net for agency."
- Software engineering isn't very relevant. Knowledge about machine learning isn't very relevant. A strong math background isn't directly needed but would be helpful for the thought patterns. Math content knowledge that's somewhat relevant: mainly classical agent foundations topics (logic, probability, games, decision theory, algorithmic complexity).
- The project will be centered around doing standard analytic philosophy, but with much more impatience to get to the core of things, and with more willingness to radically deconstruct preconceived ideas to set the stage for creating righter concepts. So a prerequisite is to have already been seriously struggling with philosophical questions around mind / agency / language / ontology / value. Having struggled with a bit of Quine, Wittgenstein, Anscombe, Fodor, Deacon, Heidegger, Bergson, Lakoff, etc. is some positive indicator.
- You should be able and happy to "buy in to" abstract arguments, while helping keep them grounded in concrete examples and desiderata. E.g. "concepts are grounded in counterfactuals and counterfactuals are grounded in one's own possible actions" should seem relevant, like the sort of thing you might have or form opinions about (after some clarification).
- Understanding of "the MIRI view" on AGI X-risk is helpful, and practically speaking, some significant overlap with that view is probably needed to hit the ground running--to be aimed at the hard problems of intent alignment. Indeed: Applicants must be interested in the hard problem of AGI intent alignment.
- Applicants must know and/or be willing to learn from me that unexamined deference will render them unable to make any key progress.

AI Value Alignment

Mentors

Brad KnoxResearch Associate Professor, UT Austin

Micah CarrollPhD Student, UC Berkeley CHAI

Matija FranklinResearcher, AI Objectives Institute

Philip Moreira TomeiResearcher, AI Objectives Institute

Tsvi Benson-TilsenResearcher, MIRI

Brad Knox
Research Associate Professor, UT Austin

Micah Carroll
PhD Student, UC Berkeley CHAI

Matija Franklin
Researcher, AI Objectives Institute

Philip Moreira Tomei
Researcher, AI Objectives Institute

Tsvi Benson-Tilsen
Researcher, MIRI