Tomek Korbak comments on Pretraining Language Models with Human Preferences

Tomek Korbak 23 Feb 2023 18:05 UTC
LW: 4 AF: 2
0
AF

Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That’s a good point. We haven’t systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I’d love to see that.

Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.

Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
- Logan Riggs 28 Feb 2023 0:46 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
  - Tomek Korbak 28 Feb 2023 10:53 UTC
    8 points
    0
    Parent
    For filtering it was 25% of best scores, so we effectively trained for 4 epochs.
    
    (We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)
- Evan R. Murphy 4 Mar 2023 1:01 UTC
  1 point
  0
  Parent
  Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
  Sounds like a good approach. How do you go about doing this?
  - Tomek Korbak 4 Mar 2023 13:04 UTC
    1 point
    0
    Parent
    I don’t remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.