The genre of plans that I’d recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that’s selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the “easiest-to-align” condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
In the context of AI alignment, the “Principle of Minimality” or “Principle of Least Everything” says that when we are building the firstsufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.
E.g., if we think it’s a dangerous kind of event for the AGI to ask “How can I achieve this end using strategies from across every possible domain?” then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).
Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.
MIRI itself isn’t pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is “make progress on figuring out AI alignment”. So I can separately answer the question “what’s MIRI’s organizational plan for solving alignment?”
My answer to that question is: we don’t currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.
If an alignment idea strikes us as having even a tiny scrap of hope, and isn’t already funding-saturated, then we’re making sure it gets funded. We don’t care whether that happens at MIRI versus elsewhere — we’re just seeking to maximize the amount of good work that’s happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the “tiny scrap of hope” bar.)
None of the research directions we’re aware of currently meet our “significant amount of hope” bar, but several things meet the “tiny scrap of hope” bar, so we’re continuing to keep an eye out and support others’ work, while not going all-in on any one approach.
Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI’s research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org’s attention on those directions. We just think they’re worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.
Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the “Agent Foundations” problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees “Embedded Agency” less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)
(Caveat: I had input from some other MIRI staff in writing the above, but I’m speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)
The genre of plans that I’d recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that’s selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.
FYI, I think there’s a huge difference between “I think humanity needs to aim for a pivotal act” and “I recommend to groups pushing the capabilities frontier forward to aim for pivotal act”. I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have generally demonstrated pretty bad judgment by, um, being capabilities researchers.
The genre of plans that I’d recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that’s selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the “easiest-to-align” condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.
MIRI itself isn’t pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is “make progress on figuring out AI alignment”. So I can separately answer the question “what’s MIRI’s organizational plan for solving alignment?”
My answer to that question is: we don’t currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.
If an alignment idea strikes us as having even a tiny scrap of hope, and isn’t already funding-saturated, then we’re making sure it gets funded. We don’t care whether that happens at MIRI versus elsewhere — we’re just seeking to maximize the amount of good work that’s happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the “tiny scrap of hope” bar.)
None of the research directions we’re aware of currently meet our “significant amount of hope” bar, but several things meet the “tiny scrap of hope” bar, so we’re continuing to keep an eye out and support others’ work, while not going all-in on any one approach.
Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI’s research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org’s attention on those directions. We just think they’re worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.
Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the “Agent Foundations” problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees “Embedded Agency” less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)
(Caveat: I had input from some other MIRI staff in writing the above, but I’m speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)
FYI, I think there’s a huge difference between “I think humanity needs to aim for a pivotal act” and “I recommend to groups pushing the capabilities frontier forward to aim for pivotal act”. I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have generally demonstrated pretty bad judgment by, um, being capabilities researchers.