I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories.
30min to 1 hour weekly meetings (on zoom) by default for high-level guidance. I'm active on Slack and typically respond within a day for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack.
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort. https://arxiv.org/abs/2510.01367
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. https://arxiv.org/abs/2510.20270
Language Models Learn to Mislead Humans via RLHF. https://arxiv.org/abs/2409.12822
Week 1-2: Mentor will provide high level directions or problems to work on, and scholar will have the freedom to propose specific projects and discuss with mentor.
Week 3: Figure out detailed plan of the project.