Noosphere89 comments on Shutting Down the Lightcone Offices

Noosphere89 Mar 17, 2023, 9:50 PM
2 points
1
Admittedly, I got that from Deceptive alignment is <1% likely post.

Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.

Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.

Here’s the post:

https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences