the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
A) it is auto-induced distributional shift
B) the developers never decided to deploy