I thought about this a bit and have a few suggestions.
1. You could try using a ctrl or start-of-sentence token to distinguish text generated by the model from the prompt (see https://arxiv.org/abs/1909.05858 for the terminology “ctrl”). If you decorated every prompt to look like [prompt_token1, prompt_token2,…prompt_tokenn, HARMLESS_MODEL_START], then the model would better be able to compartmentalize between prompts it’s being fed and and what it’s asked to generate. This would also let you train on the prompt tokens, so you’d pay less of a safety penalty. Another important advantage is that this would enable you do use the model interactively, rather than as something that gives one-off completions. Relatedly you could then play around with prompt tuning.
2. Train a harmfulness classifier with causal aka left-to-right aka autoregressive masking, so that you can then backpropagate through to the generator, rather than have to do some kind of rl thing.
3. Indeed, the generator itself could double as this autoregressive classifier. This would make it more interpretable when the model thinks it’s saying something violent, and how it feels about the alternative options for each token. This could also help you find more probably violent samples, since you can upsample tokens which have a high (violence * probability) score.
4. I wouldn’t expect this next approach to scale to more agenty models that aren’t language models, but one thing you could to do complement 1 and 3 is to also train the model (or another language model) to generate violent completions on demand with a VIOLENT_MODEL_START ctrl token. You could then do things like look at the relative probability of a sample being generated from the violent model vs the harmless model. Maybe something interesting happens when both of those probability are high compared to a baseline NEUTRAL_MODEL_START’s probability of generating that sample. You could also add an auxiliary loss term to distance the harmless model from the harmful one e.g. log(P(token; M_harmful))
I thought about this a bit and have a few suggestions.
1. You could try using a ctrl or start-of-sentence token to distinguish text generated by the model from the prompt (see https://arxiv.org/abs/1909.05858 for the terminology “ctrl”). If you decorated every prompt to look like [prompt_token1, prompt_token2,…prompt_tokenn, HARMLESS_MODEL_START], then the model would better be able to compartmentalize between prompts it’s being fed and and what it’s asked to generate. This would also let you train on the prompt tokens, so you’d pay less of a safety penalty. Another important advantage is that this would enable you do use the model interactively, rather than as something that gives one-off completions. Relatedly you could then play around with prompt tuning.
2. Train a harmfulness classifier with causal aka left-to-right aka autoregressive masking, so that you can then backpropagate through to the generator, rather than have to do some kind of rl thing.
3. Indeed, the generator itself could double as this autoregressive classifier. This would make it more interpretable when the model thinks it’s saying something violent, and how it feels about the alternative options for each token. This could also help you find more probably violent samples, since you can upsample tokens which have a high (violence * probability) score.
4. I wouldn’t expect this next approach to scale to more agenty models that aren’t language models, but one thing you could to do complement 1 and 3 is to also train the model (or another language model) to generate violent completions on demand with a VIOLENT_MODEL_START ctrl token. You could then do things like look at the relative probability of a sample being generated from the violent model vs the harmless model. Maybe something interesting happens when both of those probability are high compared to a baseline NEUTRAL_MODEL_START’s probability of generating that sample. You could also add an auxiliary loss term to distance the harmless model from the harmful one e.g. log(P(token; M_harmful))