Now it’s true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.
Yeah, that’s the possibility the post explores.
At a high level, I don’t think we really need to be concerned with this form of “internal lookahead” unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).
Is there an easy way to detect if it’s started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that’s (a) possible and (b) wouldn’t necessarily involve a big performance penalty.)
My intuitions on this matter are: 1) Stopping mesa-optimizing completely seems mad hard. 2) Managing “incentives” is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence. 3) On the other hand, it’s probably won’t scale forever.
To elaborate on the incentive management thing… if we figure that stuff out and do it right and it has the promise that I think it does… then it won’t restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.
If we’re still in a situation where the AI doesn’t understand its physical environment and isn’t incentivized to learn to control it, then we can do simple things like use a fixed dataset (as opposed to data we’re collecting online) in order to make it harder for the AI to learn anything significant about its physical environment.
Learning about the physical environment and using it to improve performance is not necessarily bad/scary absent incentives for control. However, I worry that having a good world model makes an AI much more liable to infer that it should try to control and not just predict the world.
As I mentioned in the post, I don’t think this is a binary, and stopping mesa-optimization “incompletely” seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn’t seem mad hard to me.
Managing “incentives” is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.
I’m less optimistic about this approach.
There is a stochastic aspect to training ML models, so it’s not enough to say “the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y”. If Mesa-Optimizing for Y is nearby in model-space, we’re liable to stumble across it.
Even if your mesa-optimizer is aligned, if it doesn’t have a way to stop mesa-optimization, there’s the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn’t necessarily aligned.
I’m picturing value learning via (un)supervised learning, and I don’t see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)
My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you’re shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the “Competent” part in place before the “Human Values” part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.
On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we’re liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it’s less likely that we’ll hit the smaller target of a Competent Mesa-Optimizer.
By managing incentives I expect we can, in practice, do things like: “[telling it to] restrict its lookahead to particular domains”… or remove any incentive for control of the environment.
Yeah, that’s the possibility the post explores.
Is there an easy way to detect if it’s started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that’s (a) possible and (b) wouldn’t necessarily involve a big performance penalty.)
My intuitions on this matter are:
1) Stopping mesa-optimizing completely seems mad hard.
2) Managing “incentives” is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.
3) On the other hand, it’s probably won’t scale forever.
To elaborate on the incentive management thing… if we figure that stuff out and do it right and it has the promise that I think it does… then it won’t restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.
If we’re still in a situation where the AI doesn’t understand its physical environment and isn’t incentivized to learn to control it, then we can do simple things like use a fixed dataset (as opposed to data we’re collecting online) in order to make it harder for the AI to learn anything significant about its physical environment.
Learning about the physical environment and using it to improve performance is not necessarily bad/scary absent incentives for control. However, I worry that having a good world model makes an AI much more liable to infer that it should try to control and not just predict the world.
As I mentioned in the post, I don’t think this is a binary, and stopping mesa-optimization “incompletely” seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn’t seem mad hard to me.
I’m less optimistic about this approach.
There is a stochastic aspect to training ML models, so it’s not enough to say “the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y”. If Mesa-Optimizing for Y is nearby in model-space, we’re liable to stumble across it.
Even if your mesa-optimizer is aligned, if it doesn’t have a way to stop mesa-optimization, there’s the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn’t necessarily aligned.
I’m picturing value learning via (un)supervised learning, and I don’t see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)
My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you’re shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the “Competent” part in place before the “Human Values” part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.
On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we’re liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it’s less likely that we’ll hit the smaller target of a Competent Mesa-Optimizer.
By managing incentives I expect we can, in practice, do things like: “[telling it to] restrict its lookahead to particular domains”… or remove any incentive for control of the environment.
I think we’re talking past each other a bit here.