Daniel Kokotajlo comments on Does SGD Produce Deceptive Alignment?

Daniel Kokotajlo 7 Nov 2020 10:15 UTC
LW: 15 AF: 9
AF
I feel like this post should be a chapter in a textbook or something. Thank you. Maybe soon someone will actually compile a textbook (or a sequence that serves the same purpose) and this post will be in it.
For example, if a model knew that it was in its last training episode, it would clearly want to optimize for its proxy objective; there would be no particular reason to deceive the training process into thinking it was aligned. Myopic models always behave as if they were in their last training episode.
I don’t follow this. If it knew it was in its last training episode, wouldn’t it think “I just have to deceive them into thinking I’m aligned one more time, and then I’m free to defect!”
- Mark Xu 7 Nov 2020 16:40 UTC
  LW: 7 AF: 5
  AF Parent
  Yep. Meant to say “if a model knew that it was in its last training episode and it wasn’t going to be deployed.” Should be fixed.