Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.
Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.
I have a talk that I made after our Sleeper Agents paper where I put 5 − 10%, which actually I think is also pretty much my current well-considered view.
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.