Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode?
We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.
Here it would be helpful to know what the AI produces when prompted by <|bad|>.
That’s a good point. We haven’t systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I’d love to see that.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
We did, LMs tends to generate toxic text when conditioned on
<|bad|>
. Though we tended to have a risk-aversive thresholds, i.e. we used<|good|>
for only about 5% safest sentences and<|bad|>
for the remaining 95%. So<|bad|>
is not bad all the time.That’s a good point. We haven’t systematically investigate difference in capabilities between
<|good|>
and<|bad|>
modes, I’d love to see that.Yeah, you could even block the entire direction in activation space corresponding to the embedding of the
<|bad|>
tokenIs it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
For filtering it was 25% of best scores, so we effectively trained for 4 epochs.
(We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)
Sounds like a good approach. How do you go about doing this?
I don’t remember where I saw that, but something as dumb as subtracting the embedding of
<|bad|>
might even work sometimes.