On my model, one of the most central technical challenges of alignment—and one that every viable alignment plan will probably need to grapple with—is the issue that capabilities generalize better than alignment.
Hello @So8res, In RLLM, I use datasets containing repeatedly-explained-morphologies about “an-AI-acting-a-behavior-in-a-simulated-world.” Then, I re-trained GPT2XL to “observe” these repeatedly-explained-morphologies and saw promising results. I think this process of observing repeatedly-explained-morphologies is very similar to how a language model acquiring biases during pre-training and if the language model is capable enough, it will acquire an understanding of the values (including the simulated world).
Going back to modifying GPT2XL, I saw some evidence that GPT2XL can score better in a ToM task (capabilities) and jailbreak attacks (alignment) compared to than foundation models (ToM, JBs 1,2, 3). I would like to know or hear your thoughts on this approach—Is this a good attempt in your books to solve the hard bit challenge, that capabilities generalize better than alignment? Thank you for your time reading this.
Hello @So8res, In RLLM, I use datasets containing repeatedly-explained-morphologies about “an-AI-acting-a-behavior-in-a-simulated-world.” Then, I re-trained GPT2XL to “observe” these repeatedly-explained-morphologies and saw promising results. I think this process of observing repeatedly-explained-morphologies is very similar to how a language model acquiring biases during pre-training and if the language model is capable enough, it will acquire an understanding of the values (including the simulated world).
Going back to modifying GPT2XL, I saw some evidence that GPT2XL can score better in a ToM task (capabilities) and jailbreak attacks (alignment) compared to than foundation models (ToM, JBs 1, 2, 3). I would like to know or hear your thoughts on this approach—Is this a good attempt in your books to solve the hard bit challenge, that capabilities generalize better than alignment? Thank you for your time reading this.