shash42 comments on How likely is deceptive alignment?

shash42 26 Jul 2023 23:19 UTC
1 point
0
AF
It seems like this argument assumes that the model optimizes on the entire ‘training process’. Why can’t we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?
- evhub 27 Jul 2023 19:09 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Because the model can defect only on distributions that we can’t generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano’s RSA-2048 example (explained in more detail here).