All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.