There are two types of capabilities that it may be good to scope out of models:
Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.
If LLMs do not know the ideas behind these types of harmful information, how will these models protect themselves from bad actors (humans and other AIs)?
Why I ask this question? I think jailbreaks[1] works because it’s not that they got trained on how to make such, but I think LLMs doesn’t get trained enough like an average human does use harmful knowledge.[2] I think it’s still better to inform AI systems of how to use good and bad information—like utilizing it so that it can avoid or detect harm.
We lean the LLM to a different problem if we only teach it harmless information—it will become a rabbit, incapable of protecting itself from all sorts of predators.
An alternative approach that should avoid this issue is conditional pretraining: you teach the LLM both good and bad behavior, with a pretraining set that contains examples of both and labelled as such, so it understands both of them and how to tell them apart. Then at inference time, you have it emulate the good behavior. So basically, supervised learning for LLMs. Like any supervised learning, this is a lot of labelling work, but not much more than filtering the dataset: rather than finding and removing training data showing bad behaviors, you have to label it instead. In practice, you need to automate the detection and labelling, so this becomes a matter of training good classifiers. Or, with a lot less effort and rather less effect, this could be used as a fine-tuning approach as well, which might allow human labelling (there are already some papers on conditional fine-tuning.)
In one test that I did, I[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of aninsufferable robust ethics, even for jailbreaks or simple story telling requests.
Initially, I thought that integrating harmful data into the training process was ill-advised. However, this experiment’s results have altered my viewpoint. I am now more hopeful that we can channel these adverse outcomes to enrich the AI system’s knowledge base. After all, for true alignment, the AI needs to comprehend the various manifestations of harm, evil, or malevolence to effectively identify and mitigate them.
I’m not sure what you mean by “…will have an insufferable ethics…”? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a “don’t generate <evil> or <harm>” rule at a banned-token level.
I’m not sure what you mean by “…will have an insufferable ethics…”?
I changed it to “robust ethics” for clarity.
About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
My analogy actually is not using tags, I envision that each pretraining data should have a “long instruction set” attach on how to use the knowledge contained in it—as this is much more closer to how we humans do it in the real world.
I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
Maybe I’m missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
As mentioned earlier, I am ok with all sorts of differen experimental build—I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.[1]
But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.[2][3]
If LLMs do not know the ideas behind these types of harmful information, how will these models protect themselves from bad actors (humans and other AIs)?
Why I ask this question? I think jailbreaks[1] works because it’s not that they got trained on how to make such, but I think LLMs doesn’t get trained enough like an average human does use harmful knowledge.[2] I think it’s still better to inform AI systems of how to use good and bad information—like utilizing it so that it can avoid or detect harm.
We lean the LLM to a different problem if we only teach it harmless information—it will become a rabbit, incapable of protecting itself from all sorts of predators.
(or asking an LLM to tell a story about making bombs)
This is a huge chunk of the alignment problem, making the AI “care” for what we care for and I’m familiar how difficult it is.
An alternative approach that should avoid this issue is conditional pretraining: you teach the LLM both good and bad behavior, with a pretraining set that contains examples of both and labelled as such, so it understands both of them and how to tell them apart. Then at inference time, you have it emulate the good behavior. So basically, supervised learning for LLMs. Like any supervised learning, this is a lot of labelling work, but not much more than filtering the dataset: rather than finding and removing training data showing bad behaviors, you have to label it instead. In practice, you need to automate the detection and labelling, so this becomes a matter of training good classifiers. Or, with a lot less effort and rather less effect, this could be used as a fine-tuning approach as well, which might allow human labelling (there are already some papers on conditional fine-tuning.)
For more detail, see How to Control an LLM’s Behavior (why my P(DOOM) went down), which is a linkpost for the paper Pretraining Language Models with Human Preferences. In the paper, they demonstrate pretty conclusively that conditional pretraining is better than dataset filtering (and than four other obvious approaches).
In one test that I did, I[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of a
ninsufferablerobust ethics, even for jailbreaks or simple story telling requests.Conclusion of the post Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile):
I’m not sure what you mean by “…will have an insufferable ethics…”? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a “don’t generate <evil> or <harm>” rule at a banned-token level.
I changed it to “robust ethics” for clarity.
About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
My analogy actually is not using tags, I envision that each pretraining data should have a “long instruction set” attach on how to use the knowledge contained in it—as this is much more closer to how we humans do it in the real world.
No, the tags are from a related alignment technique I’m hopeful about.
I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
Check out the paper I linked to in my original comment.
Maybe I’m missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
As mentioned earlier, I am ok with all sorts of differen experimental build—I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.[1]
But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.[2] [3]
The AI should still be “useful” or can still generalize after the alignment pre-training method or fine tuning method performed.
Additionally, an experiment on if we can trap both perspectives of safe and harmful data, via conditional training or tags, or instructional tags.
Or after pretraining, manage the good and bad data via finetuning. (Eg. Reinforcement Learning from Framework Continuums Note: I wrote this.)