Very cool! Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don’t want the good/evil switch to ever get set to the “be evil” side.
I see two three mitigations to this line of thinking:
Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
[Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]
I’m also morbidly curious what the model would do in <|bad|> mode.
I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well? I.e. It seems like the post is saying that the whole reason you need to include the <|bad|>s at all in the training dataset is that the model needs them in order to correctly generalize, even when predicting <|good|> sentences.
Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode?
We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.
Here it would be helpful to know what the AI produces when prompted by <|bad|>.
That’s a good point. We haven’t systematically investigate difference in capabilities between<|good|> and <|bad|> modes, I’d love to see that.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
Yeah, you could even block the entire direction in activation space corresponding to the embedding of the <|bad|> token
Very cool! Have you tested the AI’s outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don’t want the good/evil switch to ever get set to the “be evil” side.
I see
twothree mitigations to this line of thinking:Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
[Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]
I’m also morbidly curious what the model would do in <|bad|> mode.
I’m guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well? I.e. It seems like the post is saying that the whole reason you need to include the <|bad|>s at all in the training dataset is that the model needs them in order to correctly generalize, even when predicting <|good|> sentences.
That would be my guess too.
We did, LMs tends to generate toxic text when conditioned on
<|bad|>
. Though we tended to have a risk-aversive thresholds, i.e. we used<|good|>
for only about 5% safest sentences and<|bad|>
for the remaining 95%. So<|bad|>
is not bad all the time.That’s a good point. We haven’t systematically investigate difference in capabilities between
<|good|>
and<|bad|>
modes, I’d love to see that.Yeah, you could even block the entire direction in activation space corresponding to the embedding of the
<|bad|>
tokenIs it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
For filtering it was 25% of best scores, so we effectively trained for 4 epochs.
(We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)
Sounds like a good approach. How do you go about doing this?
I don’t remember where I saw that, but something as dumb as subtracting the embedding of
<|bad|>
might even work sometimes.