Robert_AIZI comments on Pretraining Language Models with Human Preferences

Robert_AIZI 22 Feb 2023 17:51 UTC
LW: 12 AF: 6
1
AF
Very cool! Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don’t want the good/evil switch to ever get set to the “be evil” side.

I see ~~two~~ three mitigations to this line of thinking:
- Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
- Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
- [Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]
What links here?
- Robert_AIZI's comment on Remarks 1–18 on GPT (compressed) by Cleo Nardo (21 Mar 2023 13:22 UTC; 12 points)
- Insub 23 Feb 2023 5:31 UTC
  6 points
  1
  Parent
  I’m also morbidly curious what the model would do in <|bad|> mode.
  I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well? I.e. It seems like the post is saying that the whole reason you need to include the <|bad|>s at all in the training dataset is that the model needs them in order to correctly generalize, even when predicting <|good|> sentences.
  - Tomek Korbak 23 Feb 2023 18:17 UTC
    1 point
    0
    Parent
    
    I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well?
    
    That would be my guess too.
- Tomek Korbak 23 Feb 2023 18:05 UTC
  LW: 4 AF: 2
  0
  AF Parent
  
  Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode?
  
  We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.
  
  Here it would be helpful to know what the AI produces when prompted by <|bad|>.
  
  That’s a good point. We haven’t systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I’d love to see that.
  
  Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
  
  Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
  - Logan Riggs 28 Feb 2023 0:46 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
    - Tomek Korbak 28 Feb 2023 10:53 UTC
      8 points
      0
      Parent
      For filtering it was 25% of best scores, so we effectively trained for 4 epochs.
      
      (We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)
  - Evan R. Murphy 4 Mar 2023 1:01 UTC
    1 point
    0
    Parent
    Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
    Sounds like a good approach. How do you go about doing this?
    - Tomek Korbak 4 Mar 2023 13:04 UTC
      1 point
      0
      Parent
      I don’t remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.