Thanks. This is great!
I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about:
[...] ML researchers are quite familiar with the phenomenon: it’s easier to write a program which finds a high-performance machine translation system for you than to directly write one yourself.
[...] Problems seem to arise because you try to solve a problem which you don’t yet know how to solve by searching over a large space and hoping “someone” can solve it.
If the source of the issue is the solution of problems by massive search, perhaps we should look for different ways to solve problems. Perhaps we should solve problems by figuring things out. But how do you solve problems which you don’t yet know how to solve other than by trying things?
Thanks. This is great! I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about: