DavidW comments on Deceptive Alignment is <1% Likely by Default

DavidW 21 Mar 2023 20:09 UTC
1 point
−1
Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn’t directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn’t (yet) deceptively aligned.
I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won’t be dangerous and should shift when the model switches to reinforcement learning.
This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.