In any case, I’d be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.
Thinking over this question myself, I think I’ve found a reasonable answer. Still interested in your thoughts but I’ll write down mine:
It seems like evolution “wanted” us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and “implemented” this by having our brains internally do “heavy RL” throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.
So the important difference is that with “pretraining + light RLHF” there’s no “heavy RL” step.
Thinking over this question myself, I think I’ve found a reasonable answer. Still interested in your thoughts but I’ll write down mine:
It seems like evolution “wanted” us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and “implemented” this by having our brains internally do “heavy RL” throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.
So the important difference is that with “pretraining + light RLHF” there’s no “heavy RL” step.