I feel like this post should be a chapter in a textbook or something. Thank you. Maybe soon someone will actually compile a textbook (or a sequence that serves the same purpose) and this post will be in it.
For example, if a model knew that it was in its last training episode, it would clearly want to optimize for its proxy objective; there would be no particular reason to deceive the training process into thinking it was aligned. Myopic models always behave as if they were in their last training episode.
I don’t follow this. If it knew it was in its last training episode, wouldn’t it think “I just have to deceive them into thinking I’m aligned one more time, and then I’m free to defect!”
I feel like this post should be a chapter in a textbook or something. Thank you. Maybe soon someone will actually compile a textbook (or a sequence that serves the same purpose) and this post will be in it.
I don’t follow this. If it knew it was in its last training episode, wouldn’t it think “I just have to deceive them into thinking I’m aligned one more time, and then I’m free to defect!”
Yep. Meant to say “if a model knew that it was in its last training episode and it wasn’t going to be deployed.” Should be fixed.