UriKatz comments on Deceptive Alignment is <1% Likely by Default

UriKatz 21 Mar 2023 2:37 UTC
3 points
2
“Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.”
Why is it impossible for our model which is pre-trained on the whole internet to pick up consequentialism and maximization, especially when it is already picking up non-consequentialist ethics and developing a “nuanced understanding” and “some understanding of direction following … without any reinforcement learning”? Why is it not possible to gain goal-directness from pre-training on the whole internet, thereby learning it before the base goal is conceptualized/understood? For that matter, why can’t the model pickup goal-directedness and a proxy-goal at this stage? To complicate matters more couldn’t it pick up goal-directedness and a proxy-goal without picking up consequentialism and maximization?
- DavidW 21 Mar 2023 20:09 UTC
  1 point
  −1
  Parent
  Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn’t directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn’t (yet) deceptively aligned.
  I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won’t be dangerous and should shift when the model switches to reinforcement learning.
  This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.