I’m frustrated with the meme that “mesa-optimization/pseudo-alignment is a robustness (i.e. OOD) problem”. IIUC, this is definitionally true in the mesa-optimization paper, but I think this misses the point.
In particular, this seems to exclude an important (maybe the most important) threat model: the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
This is exactly how I imagine a treacherous turn from a boxed superintelligent AI agent to occur, for instance. It secretly begins breaking out of the box (e.g. via manipulating humans) and we don’t notice until its too late.
the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
I’m frustrated with the meme that “mesa-optimization/pseudo-alignment is a robustness (i.e. OOD) problem”. IIUC, this is definitionally true in the mesa-optimization paper, but I think this misses the point.
In particular, this seems to exclude an important (maybe the most important) threat model: the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
This is exactly how I imagine a treacherous turn from a boxed superintelligent AI agent to occur, for instance. It secretly begins breaking out of the box (e.g. via manipulating humans) and we don’t notice until its too late.
Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
No, I’m talking about it breaking out during training. The only “shifts” here are:
1) the AI gets smarter
2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
A) it is auto-induced distributional shift
B) the developers never decided to deploy