In one test that I did, I[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of aninsufferable robust ethics, even for jailbreaks or simple story telling requests.
Initially, I thought that integrating harmful data into the training process was ill-advised. However, this experiment’s results have altered my viewpoint. I am now more hopeful that we can channel these adverse outcomes to enrich the AI system’s knowledge base. After all, for true alignment, the AI needs to comprehend the various manifestations of harm, evil, or malevolence to effectively identify and mitigate them.
I’m not sure what you mean by “…will have an insufferable ethics…”? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a “don’t generate <evil> or <harm>” rule at a banned-token level.
I’m not sure what you mean by “…will have an insufferable ethics…”?
I changed it to “robust ethics” for clarity.
About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
My analogy actually is not using tags, I envision that each pretraining data should have a “long instruction set” attach on how to use the knowledge contained in it—as this is much more closer to how we humans do it in the real world.
I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
Maybe I’m missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
As mentioned earlier, I am ok with all sorts of differen experimental build—I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.[1]
But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.[2][3]
In one test that I did, I[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of a
ninsufferablerobust ethics, even for jailbreaks or simple story telling requests.Conclusion of the post Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile):
I’m not sure what you mean by “…will have an insufferable ethics…”? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a “don’t generate <evil> or <harm>” rule at a banned-token level.
I changed it to “robust ethics” for clarity.
About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
My analogy actually is not using tags, I envision that each pretraining data should have a “long instruction set” attach on how to use the knowledge contained in it—as this is much more closer to how we humans do it in the real world.
No, the tags are from a related alignment technique I’m hopeful about.
I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
Check out the paper I linked to in my original comment.
Maybe I’m missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
As mentioned earlier, I am ok with all sorts of differen experimental build—I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.[1]
But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.[2] [3]
The AI should still be “useful” or can still generalize after the alignment pre-training method or fine tuning method performed.
Additionally, an experiment on if we can trap both perspectives of safe and harmful data, via conditional training or tags, or instructional tags.
Or after pretraining, manage the good and bad data via finetuning. (Eg. Reinforcement Learning from Framework Continuums Note: I wrote this.)