He He

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories. Scholar will have freedom to propose projects within this scope. Expect 30-60min 1-1 time on zoom.

Stream overview

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories.

Mentors

He He
New York University
,
Associate Professor
New York City
Monitoring
Dangerous Capability Evals
Scalable Oversight
Safeguards

He He is an associate professor at New York University. She is interested in how large language models work and potential risks of this technology.

Read more

Mentorship style

30min to 1 hour weekly meetings (on zoom) by default for high-level guidance. I'm active on Slack and typically respond within a day for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack.

Representative papers

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort. https://arxiv.org/abs/2510.01367

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. https://arxiv.org/abs/2510.20270

Language Models Learn to Mislead Humans via RLHF. https://arxiv.org/abs/2409.12822

Scholars we are looking for

  • Strong engineering skills and experience Sin training deep learning models
  • Familiarity with modern large scale RL pipelines (e.g., using frameworks such as verl)

Project selection

Week 1-2: Mentor will provide high level directions or problems to work on, and scholar will have the freedom to propose specific projects and discuss with mentor.

Week 3: Figure out detailed plan of the project.