To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.