In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn’t happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you’re reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn’t previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it’s better than nothing.
In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn’t happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you’re reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn’t previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it’s better than nothing.