Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn’t build ourselves, which is a worst-case scenario for interpretability / understandability.
If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing the field to steer toward approaches to AGI design that do less buck-passing and produce less opaque systems.
This improved understandability then lets us do important alignment work like understanding what AGIs (or components of AGIs, etc.) are optimizing for, understanding what topics they’re thinking about and what domains they’re skilled in, understanding (and limiting) how much optimization they put into various problems, and generally constructing good safety-stories for why (given assumptions X, Y, Z) our system will only put cognitive work into things we want it to work on.
Thanks. This is great!
I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about:
[...] ML researchers are quite familiar with the phenomenon: it’s easier to write a program which finds a high-performance machine translation system for you than to directly write one yourself.
[...] Problems seem to arise because you try to solve a problem which you don’t yet know how to solve by searching over a large space and hoping “someone” can solve it.
If the source of the issue is the solution of problems by massive search, perhaps we should look for different ways to solve problems. Perhaps we should solve problems by figuring things out. But how do you solve problems which you don’t yet know how to solve other than by trying things?
Example I just made up:
Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn’t build ourselves, which is a worst-case scenario for interpretability / understandability.
If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing the field to steer toward approaches to AGI design that do less buck-passing and produce less opaque systems.
This improved understandability then lets us do important alignment work like understanding what AGIs (or components of AGIs, etc.) are optimizing for, understanding what topics they’re thinking about and what domains they’re skilled in, understanding (and limiting) how much optimization they put into various problems, and generally constructing good safety-stories for why (given assumptions X, Y, Z) our system will only put cognitive work into things we want it to work on.
Thanks. This is great! I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about: