evhub comments on capybaralet’s Shortform

evhub 17 Sep 2020 19:58 UTC
6 points

the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.

Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
- David Scott Krueger (formerly: capybaralet) 18 Sep 2020 23:06 UTC
  3 points
  Parent
  No, I’m talking about it breaking out during training. The only “shifts” here are:
  1) the AI gets smarter
  2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
  We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
  A) it is auto-induced distributional shift
  B) the developers never decided to deploy