MiguelDev comments on Deep Forgetting & Unlearning for Safely-Scoped LLMs

MiguelDev 7 Dec 2023 1:07 UTC
1 point
0
In one test that I did, I^[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of an ~~insufferable~~ robust ethics, even for jailbreaks or simple story telling requests.
1. ^
  Conclusion of the post Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile):
  
  Initially, I thought that integrating harmful data into the training process was ill-advised. However, this experiment’s results have altered my viewpoint. I am now more hopeful that we can channel these adverse outcomes to enrich the AI system’s knowledge base. After all, for true alignment, the AI needs to comprehend the various manifestations of harm, evil, or malevolence to effectively identify and mitigate them.
- RogerDearnaley 7 Dec 2023 1:24 UTC
  2 points
  0
  Parent
  I’m not sure what you mean by “…will have an insufferable ethics…”? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a “don’t generate <evil> or <harm>” rule at a banned-token level.
  - MiguelDev 7 Dec 2023 1:41 UTC
    1 point
    0
    Parent
    I’m not sure what you mean by “…will have an insufferable ethics…”?
    
    I changed it to “robust ethics” for clarity.
    
    About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.
    
    My analogy actually is not using tags, I envision that each pretraining data should have a “long instruction set” attach on how to use the knowledge contained in it—as this is much more closer to how we humans do it in the real world.
    - RogerDearnaley 7 Dec 2023 1:59 UTC
      2 points
      0
      Parent
      No, the tags are from a related alignment technique I’m hopeful about.
      - MiguelDev 7 Dec 2023 2:09 UTC
        1 point
        0
        Parent
        I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.
        
        RogerDearnaley 7 Dec 2023 2:24 UTC
        2 points
        0
        Parent
        Check out the paper I linked to in my original comment.
        MiguelDev 7 Dec 2023 3:36 UTC
        1 point
        0
        Parent
        Maybe I’m missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.
        
        As mentioned earlier, I am ok with all sorts of differen experimental build—I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.^[1]
        
        But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.^[2] ^[3]
        
        ^
        The AI should still be “useful” or can still generalize after the alignment pre-training method or fine tuning method performed.
        ^
        Additionally, an experiment on if we can trap both perspectives of safe and harmful data, via conditional training or tags, or instructional tags.
        ^
        Or after pretraining, manage the good and bad data via finetuning. (Eg. Reinforcement Learning from Framework Continuums Note: I wrote this.)