Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
Here’s the post:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences