No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
A) it is auto-induced distributional shift
B) the developers never decided to deploy