There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.
This part is also unintuitive to me/I don’t really follow along with your argument here.
I think where I’m being lost is the following claim:
Conceptualisation of base objective removes selection pressure on the mesa-objective
It seems plausible to me that conceptualisation of the base objective might instead exert selection pressure to shift the mesa-objective towards base objective and lead to internalisation of the base objective instead of crystallisation of deceptive alignment.
Or rather that’s what I would expect by default?
There’s a jump being made here from “the mesa-optimiser learns the base objective” to “the mesa-optimiser becomes deceptively aligned” that’s just so very non obvious to me.
Why doesn’t conceptualisation of the base objective heavily select for robust alignment?
Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.
I also very much do not follow this argument for the aforementioned reasons.
We’ve already established that the mesa-optimiser will eventually conceptualise the base objective.
Once the base objective has been conceptualised, there is a model of the base objective to “point to”.
The claim that conceptualisation of the base objective makes deceptive alignment likely is very much non obvious. The conditions under which deceptive alignment is supposed to arise seems to me like conditions that will select for internalisation of the base objective or corrigible alignment.
[Once the mesa-optimiser has conceptualised the base objective, why is deceptive alignment more likely than a retargeting of the mesa-optimiser’s internal search process to point towards the base objective? I’d naively expect the parameter updates of the base optimiser to exert selection pressure on the mesa-optimiser’s towards retargeting its search at the base objective.]
I do agree that deceptively aligned mesa-optimisers will behave corrigibly with respect to the base objective, but I’m being lost in the purported mechanisms by which deceptive alignment is supposed to arise.
All in all, these two paragraphs give me the impression of non obvious/insufficiently justified claims and circularish reasoning.
This part is also unintuitive to me/I don’t really follow along with your argument here.
I think where I’m being lost is the following claim:
Conceptualisation of base objective removes selection pressure on the mesa-objective
It seems plausible to me that conceptualisation of the base objective might instead exert selection pressure to shift the mesa-objective towards base objective and lead to internalisation of the base objective instead of crystallisation of deceptive alignment.
Or rather that’s what I would expect by default?
There’s a jump being made here from “the mesa-optimiser learns the base objective” to “the mesa-optimiser becomes deceptively aligned” that’s just so very non obvious to me.
Why doesn’t conceptualisation of the base objective heavily select for robust alignment?
I also very much do not follow this argument for the aforementioned reasons.
We’ve already established that the mesa-optimiser will eventually conceptualise the base objective.
Once the base objective has been conceptualised, there is a model of the base objective to “point to”.
The claim that conceptualisation of the base objective makes deceptive alignment likely is very much non obvious. The conditions under which deceptive alignment is supposed to arise seems to me like conditions that will select for internalisation of the base objective or corrigible alignment.
[Once the mesa-optimiser has conceptualised the base objective, why is deceptive alignment more likely than a retargeting of the mesa-optimiser’s internal search process to point towards the base objective? I’d naively expect the parameter updates of the base optimiser to exert selection pressure on the mesa-optimiser’s towards retargeting its search at the base objective.]
I do agree that deceptively aligned mesa-optimisers will behave corrigibly with respect to the base objective, but I’m being lost in the purported mechanisms by which deceptive alignment is supposed to arise.
All in all, these two paragraphs give me the impression of non obvious/insufficiently justified claims and circularish reasoning.