DavidW comments on Deceptive Alignment

DavidW 13 Mar 2023 13:32 UTC
1 point
0
Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”
Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.
However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.
A model needs situational awareness, a long-term goal, a way to tell if it’s in training, and a way to identify the base goal that isn’t its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely.
Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training.