As I mentioned in the post, I don’t think this is a binary, and stopping mesa-optimization “incompletely” seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn’t seem mad hard to me.
Managing “incentives” is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.
I’m less optimistic about this approach.
There is a stochastic aspect to training ML models, so it’s not enough to say “the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y”. If Mesa-Optimizing for Y is nearby in model-space, we’re liable to stumble across it.
Even if your mesa-optimizer is aligned, if it doesn’t have a way to stop mesa-optimization, there’s the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn’t necessarily aligned.
I’m picturing value learning via (un)supervised learning, and I don’t see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)
My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you’re shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the “Competent” part in place before the “Human Values” part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.
On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we’re liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it’s less likely that we’ll hit the smaller target of a Competent Mesa-Optimizer.
By managing incentives I expect we can, in practice, do things like: “[telling it to] restrict its lookahead to particular domains”… or remove any incentive for control of the environment.
As I mentioned in the post, I don’t think this is a binary, and stopping mesa-optimization “incompletely” seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn’t seem mad hard to me.
I’m less optimistic about this approach.
There is a stochastic aspect to training ML models, so it’s not enough to say “the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y”. If Mesa-Optimizing for Y is nearby in model-space, we’re liable to stumble across it.
Even if your mesa-optimizer is aligned, if it doesn’t have a way to stop mesa-optimization, there’s the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn’t necessarily aligned.
I’m picturing value learning via (un)supervised learning, and I don’t see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)
My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you’re shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the “Competent” part in place before the “Human Values” part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.
On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we’re liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it’s less likely that we’ll hit the smaller target of a Competent Mesa-Optimizer.
By managing incentives I expect we can, in practice, do things like: “[telling it to] restrict its lookahead to particular domains”… or remove any incentive for control of the environment.
I think we’re talking past each other a bit here.