I’ve been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target)
I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
I’ve been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target)
I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
Want to collaborate on this experiment idea you have? I have time, and can do the implementation work while you mostly instruct/mentor me.