Thanks for writing this post, it’s great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:
(Note I’m using “AI” instead of “model” to avoid confusing myself between “model” and “world model”, e.g. “the deceptively aligned AI’s world model” instead of “the deceptively-aligned model’s world model”).
Making goals long-term might not be easy
You say
Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals
However, this doesn’t necessarily seem all that simple. The world model and internal optimisation process need to be able to plan into the “long term”, or even have the conception of the “long term”, for the proxy goals to be long-term; this seems to heavily depend on how much the world model and internal optimisation process are capturing this.
Conditioning on the world model and internal optimisation process capturing this concept, it’s still not necessarily easy to convert proxies into long term goals, if the proxies are time-dependent in some way, as they might be—if tasks or episodes are similar lengths, then proxies like “start wrapping up my attempt at this task to present it to the human” is only useful if it’s conditioned on a time near the end of the episode. My argument here seems much sketchier, but I think this might be because I can’t come up with a good example. It seems like it’s not necessarily the case that “making goals long-term” is easy; that seems to be mostly taken on intuition that I don’t think I share
Relatedly, it seems that the conditioning on capabilities of the world model and internal optimisation process changes the path somewhat in a way that isn’t captured by your analysis. That is, it might be easier to achieve corrigible or internal alignment with a less capable world model/internal optimisation process (i.e. earlier in training), as it doesn’t require the world model/internal optimisation process to plan over the longer time horizons and greater situational awareness required to still perform well in the deceptive alignmeent case. Do you think that is the case?
On the overhang from throwing out proxies
In the high path-dependency world, you mention an overhang several times. If it understand correctly, what you’re referring to here is that, as the world model increases in capabilities, it will start modelling things that are useful as internal optimisation targets for maximising the training objective, and then at some point SGD could just through away the AI’s internal goals (which we see as proxies) and instead point to these parts of the world model as the target, which would result in a large increase in the training objective, as these are much better targets. (This is the description of what would happen in the internally aligned case, but the same mechanism seems present in the other cases, as you mention).
However, it seems like the main reason the world model would capture these parts of the world is if they were useful (for maximising the training objective) as internal optimisation targets, and so if they’re emerging and improving, it’s likely because there’s pressure for them to improve as they are being used as targets. This would mean there wasn’t an overhang of the sort described above.
Another way of phrasing this might be that the internal goals (proxies) the AIs have will be part of the world model/in the same ontology/using the same representations, they won’t be separate (as your story seems to imply?), and hence there won’t be something to switch them to inside the world model that provides a bump in the training objective; or if there is, this will happen smoothly as the things to switch to are better-modelled such that they become useful targets.
I think how this affects the analysis is that, as the AI learns more about it’s training process, this involves learning more about the training objective, and if it’s doing this, it would be very easy for the internal goals to shift to pointing to this understanding of the training objective (if it’s already there). This would result in a higher likelihood of corrigible alignment. Specifically, in the case where the AI has a full understanding of the training process, including a full understanding of the training objective (such that it models all parts of it, and there’s a single pointer that points to all these parts and is hence easily referenced), it seems (a priori) equally easy to shift the internal objective proxies to be pointing to this training objective pointer, as it does to shift them to long-term goals which then require optimising for the training objective pointer instrumentally. In fact it might even be simpler, as you may need a less complex/long-time-horizon-capable world model/internal optimiser in the first case than the second.
Simplicity Bias Argument
(Building on the ideas above about how the world model could have a pointer to the training objective, and would need this for all model classes considered): In the “slightly more formal analysis” of the simplicity bias in the low-path-dependence world, you decompose the complexity of the model into a sum of it’s world model complexity, optimisation process complexity and mesa objective complexity, and argue that the mesa objective complexity is the key thing that’s going to vary between the model classes being considered. However, I don’t think this argument holds.
Specifically, for the deceptive-aligned AI to perform well during training, inside it’s world model, it needs to have a full specifical of the training objective, and some pointer P to it such that the optimisation process can say “we want {mesa_objective}, but we need to do well in training instrumentally, so plan to make the value of pointer P go higher”. Either this makes the deceptive-aligned AI’s world models more complex than the world model for the internally aligned AI (which instead has the pointer to the full specification of the training objective outside it’s world model, in the mesa objective component), or it’s the same complexity as the internally aligned model, in which case the internally aligned AI’s mesa_objective can just reference that pointer, and hence is likely the same complexity as the simplest mesa objective for the deceptively-aligned AI (which likely also just references a pointer to some attribute of the world which the world model captures).
Phrased a different way, I’m imagining that the world model, optimisation process and mesa objective are all in a shared (internal to the model) ontology, and hence the mesa objective specification isn’t going to be a copy of the parts of the world model that are the objective (which would entail also copying all the machinery necessary to actually specify this from observations), but instead just a (weighted combination of) concept(s) in the internal ontology, which will be very simple to specify.
Overall, all these considerations argue that deceptive aligned is less likely than the analysis in this post suggests. It does still seem very possible that deceptive alignment occurs, and I still agree that we need transparency tools to fix these problems, but perhaps I think we’re less underwater than Evan does (to use the terminology from the Conclusion).
Thanks for writing this post, it’s great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:
(Note I’m using “AI” instead of “model” to avoid confusing myself between “model” and “world model”, e.g. “the deceptively aligned AI’s world model” instead of “the deceptively-aligned model’s world model”).
Making goals long-term might not be easy
You say
However, this doesn’t necessarily seem all that simple. The world model and internal optimisation process need to be able to plan into the “long term”, or even have the conception of the “long term”, for the proxy goals to be long-term; this seems to heavily depend on how much the world model and internal optimisation process are capturing this.
Conditioning on the world model and internal optimisation process capturing this concept, it’s still not necessarily easy to convert proxies into long term goals, if the proxies are time-dependent in some way, as they might be—if tasks or episodes are similar lengths, then proxies like “start wrapping up my attempt at this task to present it to the human” is only useful if it’s conditioned on a time near the end of the episode. My argument here seems much sketchier, but I think this might be because I can’t come up with a good example. It seems like it’s not necessarily the case that “making goals long-term” is easy; that seems to be mostly taken on intuition that I don’t think I share
Relatedly, it seems that the conditioning on capabilities of the world model and internal optimisation process changes the path somewhat in a way that isn’t captured by your analysis. That is, it might be easier to achieve corrigible or internal alignment with a less capable world model/internal optimisation process (i.e. earlier in training), as it doesn’t require the world model/internal optimisation process to plan over the longer time horizons and greater situational awareness required to still perform well in the deceptive alignmeent case. Do you think that is the case?
On the overhang from throwing out proxies
In the high path-dependency world, you mention an overhang several times. If it understand correctly, what you’re referring to here is that, as the world model increases in capabilities, it will start modelling things that are useful as internal optimisation targets for maximising the training objective, and then at some point SGD could just through away the AI’s internal goals (which we see as proxies) and instead point to these parts of the world model as the target, which would result in a large increase in the training objective, as these are much better targets. (This is the description of what would happen in the internally aligned case, but the same mechanism seems present in the other cases, as you mention).
However, it seems like the main reason the world model would capture these parts of the world is if they were useful (for maximising the training objective) as internal optimisation targets, and so if they’re emerging and improving, it’s likely because there’s pressure for them to improve as they are being used as targets. This would mean there wasn’t an overhang of the sort described above.
Another way of phrasing this might be that the internal goals (proxies) the AIs have will be part of the world model/in the same ontology/using the same representations, they won’t be separate (as your story seems to imply?), and hence there won’t be something to switch them to inside the world model that provides a bump in the training objective; or if there is, this will happen smoothly as the things to switch to are better-modelled such that they become useful targets.
I think how this affects the analysis is that, as the AI learns more about it’s training process, this involves learning more about the training objective, and if it’s doing this, it would be very easy for the internal goals to shift to pointing to this understanding of the training objective (if it’s already there). This would result in a higher likelihood of corrigible alignment. Specifically, in the case where the AI has a full understanding of the training process, including a full understanding of the training objective (such that it models all parts of it, and there’s a single pointer that points to all these parts and is hence easily referenced), it seems (a priori) equally easy to shift the internal objective proxies to be pointing to this training objective pointer, as it does to shift them to long-term goals which then require optimising for the training objective pointer instrumentally. In fact it might even be simpler, as you may need a less complex/long-time-horizon-capable world model/internal optimiser in the first case than the second.
Simplicity Bias Argument
(Building on the ideas above about how the world model could have a pointer to the training objective, and would need this for all model classes considered): In the “slightly more formal analysis” of the simplicity bias in the low-path-dependence world, you decompose the complexity of the model into a sum of it’s world model complexity, optimisation process complexity and mesa objective complexity, and argue that the mesa objective complexity is the key thing that’s going to vary between the model classes being considered. However, I don’t think this argument holds.
Specifically, for the deceptive-aligned AI to perform well during training, inside it’s world model, it needs to have a full specifical of the training objective, and some pointer P to it such that the optimisation process can say “we want {mesa_objective}, but we need to do well in training instrumentally, so plan to make the value of pointer P go higher”. Either this makes the deceptive-aligned AI’s world models more complex than the world model for the internally aligned AI (which instead has the pointer to the full specification of the training objective outside it’s world model, in the mesa objective component), or it’s the same complexity as the internally aligned model, in which case the internally aligned AI’s mesa_objective can just reference that pointer, and hence is likely the same complexity as the simplest mesa objective for the deceptively-aligned AI (which likely also just references a pointer to some attribute of the world which the world model captures).
Phrased a different way, I’m imagining that the world model, optimisation process and mesa objective are all in a shared (internal to the model) ontology, and hence the mesa objective specification isn’t going to be a copy of the parts of the world model that are the objective (which would entail also copying all the machinery necessary to actually specify this from observations), but instead just a (weighted combination of) concept(s) in the internal ontology, which will be very simple to specify.
Overall, all these considerations argue that deceptive aligned is less likely than the analysis in this post suggests. It does still seem very possible that deceptive alignment occurs, and I still agree that we need transparency tools to fix these problems, but perhaps I think we’re less underwater than Evan does (to use the terminology from the Conclusion).