ML Alignment Theory Program under Evan Hubinger
In the past six weeks, the Stanford Existential Risks Initiative (SERI) has been running a trial for the “ML Alignment Theory Scholars” (MATS) program. Our goal is to increase the number of people working on alignment theory, and to do this, we’re running a scholars program that provides mentorship, funding, and community to promising new alignment theorists. This program is run in partnership with Evan Hubinger, who has been providing all of the mentorship to each of the scholars for their trial.
As the final phase of the trial, each participant has taken a previous research artifact (usually an Alignment Forum post) and written a distillation and expansion of that post. The posts were picked by Evan and each participant signed up for one they were interested in. Within the next two weeks (12/7 − 12⁄17), we’ll be posting all of these posts to lesswrong and the alignment forum as part of a sequence, with a couple of posts going up each day. (There will be around 10-15 posts total.)
Community Engagement
Evan will be evaluating each post to determine whether participants make it to the next stage of the seminar program (where they have the opportunity to do novel research with a mentor), but we’d also be interested in hearing community feedback on each post. This could be just through upvotes or alternatively, via comments as well. We’ll run a conclusion post with our takeaways a week or two after the final blogpost has been released. If it’s interesting, we’d also be happy to post the story of MATS’ creation for other prospective community builders.
Additionally, if Evan knows you and you would be interested in mentoring one of the participants for the next stage of the program—e.g. you really liked their post and think it would be productive to work with them—you should reach out to Evan.
Program Description
From here on out, we’ll be discussing our program and future strategy. The previous two paragraphs are all the context needed to understand the series of blog posts.
The SERI MATS program aims to gather promising alignment theorists and give them an opportunity to learn from an experienced mentor.
The application process was as follows. We first asked the mentors of the Cambridge AGI safety fundamentals for promising participants and interviewed them. After the interviews, we put the scholars through a paid, six-week long trial period for 10 hours a week. During these times, scholars read through Evan’s reading list and distilled and expanded one of the research artifacts that Evan chose. Evan will be evaluating each final blogpost to determine who gets into the scholars program.
The scholars program will initially be a two-month program in which scholars pursue one of the problems in Evan’s list of subproblems. Scholars have the option of forming teams. If they perform good research (as evaluated by Evan), we hope to be able to repeat the program beyond the initial two-month program.
Future Steps
In the immediate future, we will be looking to continue to refine the program and also expand to more mentors beyond Evan. Additionally, we’ll likely be running another round of this program this spring or summer, though we are not yet looking for applications.
Our long-term goal would be to create a self-sustaining community of alignment theorist mentees and mentors which helps build up new theorists and connect them to mentors, specific subproblems, and (hopefully) research organizations. In other words, we’re trying to facilitate the transition between having finished a reading group like AGI safety fundamentals and working on the problems as a part-time or full-time job.
Acknowledgements
We’re very grateful to be supported by a grant from Open Philanthropy. Additionally, we’d like to thank Sawyer from BERI in helping with the logistics related to the financials.
- Theoretical Neuroscience For Alignment Theory by 7 Dec 2021 21:50 UTC; 65 points) (
- Understanding Gradient Hacking by 10 Dec 2021 15:58 UTC; 41 points) (
- The Natural Abstraction Hypothesis: Implications and Evidence by 14 Dec 2021 23:14 UTC; 39 points) (
- Understanding and controlling auto-induced distributional shift by 13 Dec 2021 14:59 UTC; 33 points) (
- Introduction to inaccessible information by 9 Dec 2021 1:28 UTC; 27 points) (
- How complex are myopic imitators? by 8 Feb 2022 12:00 UTC; 26 points) (
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by 16 Dec 2021 22:41 UTC; 22 points) (
- 18 Jan 2022 20:54 UTC; 21 points) 's comment on Challenges with Breaking into MIRI-Style Research by (
- Celebrating 2021: What are your favourite wins & good news for EA, the world and yourself? by 14 Dec 2021 3:56 UTC; 19 points) (EA Forum;
- How Interpretability can be Impactful by 18 Jul 2022 0:06 UTC; 18 points) (
- Motivations, Natural Selection, and Curriculum Engineering by 16 Dec 2021 1:07 UTC; 16 points) (
- Don’t Influence the Influencers! by 19 Dec 2021 9:02 UTC; 14 points) (
- Should we rely on the speed prior for safety? by 14 Dec 2021 20:45 UTC; 14 points) (
- Universality and the “Filter” by 16 Dec 2021 0:47 UTC; 10 points) (
- Some motivations to gradient hack by 17 Dec 2021 3:06 UTC; 8 points) (
- A summary of aligning narrowly superhuman models by 10 Feb 2022 18:26 UTC; 8 points) (
I personally would be very interested in this, especially with a mind to focusing on prosaic systems alignment today (as opposed to alignment theory).
How do you define alignment theory?
Don’t have a concrete definition off the top of my head, but I can try to give you a sense of what we’re thinking about. “Alignment theory” for us refers to the class of work which is more “reason about alignment from first principles” rather than running actual experiments. (Happy to have a discussion on why this is our focus if the discussion would be useful?)
Examples: Risks from learned optimization, inaccessible information, most posts in Evan’s list of research artifacts.