My sense is that Eliezer and Nate (and I think some other researchers) updated towards shorter timelines in late 2016 / early 2017 (“moderately higher probability to AGI’s being developed before 2035” in our 2017 update). This then caused them to think AF was less promising than they’d previously thought, because it would be hard to solve AF by 2035.
On their model, as I understand it, it was good for AF research to continue (among other things, because there was still a lot of probability mass on ‘AGI is more than 20 years away’), and marginal AF progress was still likely to be useful even if we didn’t solve all of AF. And MIRI houses a lot of different views, including (AFAIK) quite different views on the tractability of AF, the expected usefulness of AF progress, the ways in which solving AF would likely be useful, etc. This wasn’t a case of ‘everyone at MIRI giving up on AF’, but it was a case of Eliezer and Nate (and some other researchers) deciding this wasn’t where they should put their own time, because it didn’t feel enough to them like ‘the mainline way things end up going well’.
My simplified story is that in 2017-2020, the place Nate and Eliezer put their time was instead the new non-public research directions Benya Fallenstein had started (which coincided with a big push to hire more people to work with Nate/Eliezer/Benya/etc. on this). In late 2020 / early 2021, Nate and Eliezer decided that this research wasn’t going fast enough, and that they should move on to other things (but they still didn’t think AF was the thing). Throughout all of this, other MIRI researchers like Scott Garrabrant (the research lead for our AF work) have continued to chug away on AF / embedded agency work.
the ways in which solving AF would likely be useful
Other than the rocket alignment analogy and the general case for deconfusion helping, has anyone ever tried to describe with more concrete (though speculative) detail how AF would help with alignment? I’m not saying it wouldn’t. I just literally want to know if anyone has tried explaining this concretely. I’ve been following for a decade but don’t think I ever saw an attempted explanation.
Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn’t build ourselves, which is a worst-case scenario for interpretability / understandability.
If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing the field to steer toward approaches to AGI design that do less buck-passing and produce less opaque systems.
This improved understandability then lets us do important alignment work like understanding what AGIs (or components of AGIs, etc.) are optimizing for, understanding what topics they’re thinking about and what domains they’re skilled in, understanding (and limiting) how much optimization they put into various problems, and generally constructing good safety-stories for why (given assumptions X, Y, Z) our system will only put cognitive work into things we want it to work on.
Thanks. This is great!
I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about:
[...] ML researchers are quite familiar with the phenomenon: it’s easier to write a program which finds a high-performance machine translation system for you than to directly write one yourself.
[...] Problems seem to arise because you try to solve a problem which you don’t yet know how to solve by searching over a large space and hoping “someone” can solve it.
If the source of the issue is the solution of problems by massive search, perhaps we should look for different ways to solve problems. Perhaps we should solve problems by figuring things out. But how do you solve problems which you don’t yet know how to solve other than by trying things?
My sense is that Eliezer and Nate (and I think some other researchers) updated towards shorter timelines in late 2016 / early 2017 (“moderately higher probability to AGI’s being developed before 2035” in our 2017 update). This then caused them to think AF was less promising than they’d previously thought, because it would be hard to solve AF by 2035.
On their model, as I understand it, it was good for AF research to continue (among other things, because there was still a lot of probability mass on ‘AGI is more than 20 years away’), and marginal AF progress was still likely to be useful even if we didn’t solve all of AF. And MIRI houses a lot of different views, including (AFAIK) quite different views on the tractability of AF, the expected usefulness of AF progress, the ways in which solving AF would likely be useful, etc. This wasn’t a case of ‘everyone at MIRI giving up on AF’, but it was a case of Eliezer and Nate (and some other researchers) deciding this wasn’t where they should put their own time, because it didn’t feel enough to them like ‘the mainline way things end up going well’.
My simplified story is that in 2017-2020, the place Nate and Eliezer put their time was instead the new non-public research directions Benya Fallenstein had started (which coincided with a big push to hire more people to work with Nate/Eliezer/Benya/etc. on this). In late 2020 / early 2021, Nate and Eliezer decided that this research wasn’t going fast enough, and that they should move on to other things (but they still didn’t think AF was the thing). Throughout all of this, other MIRI researchers like Scott Garrabrant (the research lead for our AF work) have continued to chug away on AF / embedded agency work.
This has been quite confusing even to me from the outside.
Other than the rocket alignment analogy and the general case for deconfusion helping, has anyone ever tried to describe with more concrete (though speculative) detail how AF would help with alignment? I’m not saying it wouldn’t. I just literally want to know if anyone has tried explaining this concretely. I’ve been following for a decade but don’t think I ever saw an attempted explanation.
Example I just made up:
Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn’t build ourselves, which is a worst-case scenario for interpretability / understandability.
If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing the field to steer toward approaches to AGI design that do less buck-passing and produce less opaque systems.
This improved understandability then lets us do important alignment work like understanding what AGIs (or components of AGIs, etc.) are optimizing for, understanding what topics they’re thinking about and what domains they’re skilled in, understanding (and limiting) how much optimization they put into various problems, and generally constructing good safety-stories for why (given assumptions X, Y, Z) our system will only put cognitive work into things we want it to work on.
Thanks. This is great! I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about: