I have had this idea for a while. Seems like a good thing to do, looking from a simulators/direct value alignment frame. Might make corrigibility harder depending on exact implementation. Still, I’d expect it to be net-positive.
Invitation for critiques: If nobody convinces me it’s a bad idea in a week’s time from posting, I’ll just proceed to implementation.
The concerns about data filtering raised in that post’s comments[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
I’ve been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target)
I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
I have had this idea for a while. Seems like a good thing to do, looking from a simulators/direct value alignment frame. Might make corrigibility harder depending on exact implementation. Still, I’d expect it to be net-positive.
Invitation for critiques: If nobody convinces me it’s a bad idea in a week’s time from posting, I’ll just proceed to implementation.
Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
The concerns about data filtering raised in that post’s comments[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
ex.: Jozdien citing gwern
This is indeed pretty relevant.
(you’ll want to post the text in obscure-to-humans places that won’t get a bunch of confused reactions from humans which would counter the effect)
Yes. Agreed.
I’ve been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target)
I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
Want to collaborate on this experiment idea you have? I have time, and can do the implementation work while you mostly instruct/mentor me.
I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch
Maybe in isolation, but I get the feeling that time is of the essence.