He He

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories. Scholar will have freedom to propose projects within this scope. Expect 30-60min 1-1 time on zoom.

Apply

View all streams

Stream overview

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories.

Mentors

He He

New York University

Associate Professor

New York City

—

Monitoring

Dangerous Capability Evals

Scalable Oversight

Safeguards

He He is an associate professor at New York University. She is interested in how large language models work and potential risks of this technology.

Mentorship style

30min to 1 hour weekly meetings (on zoom) by default for high-level guidance. I'm active on Slack and typically respond within a day for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack.

Representative papers

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort. https://arxiv.org/abs/2510.01367

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. https://arxiv.org/abs/2510.20270

Language Models Learn to Mislead Humans via RLHF. https://arxiv.org/abs/2409.12822

Scholars we are looking for

Strong engineering skills and experience Sin training deep learning models
Familiarity with modern large scale RL pipelines (e.g., using frameworks such as verl)

Project selection

Week 1-2: Mentor will provide high level directions or problems to work on, and scholar will have the freedom to propose specific projects and discuss with mentor.

Week 3: Figure out detailed plan of the project.