In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation to robust unlearning. If you're theory-minded, maybe you'll help us formalize shard theory itself.
Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing provides a novel form of weak supervision, enabling us to induce useful structure in models even without access to perfect labels. Unlearn and Distill succeeds at robust unlearning for the same reason that prior deep unlearning methods failed.
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Something else
We're open to supporting any conceptually motivated empirical ML research that promotes the safe development of transformative AI.
We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.
Scholars should mostly figure things out on their own outside of meetings
Ideal candidates would have:
Fellows will probably work with collaborators from within the stream.
Mentor(s) will talk through project ideas with scholar