Using density models has been tried for defending against adversarial attacks in many domains, including vision and NLP stuff. Unfortunately, it rarely seems to work, because you can often find adversarial examples for both the density model and the classifier (e.g. it was pretty easy to do this for both Redwood’s injury classifier and the fine-tuned GPT-Neo we used for generating text).
Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)
Using density models has been tried for defending against adversarial attacks in many domains, including vision and NLP stuff. Unfortunately, it rarely seems to work, because you can often find adversarial examples for both the density model and the classifier (e.g. it was pretty easy to do this for both Redwood’s injury classifier and the fine-tuned GPT-Neo we used for generating text).
Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
No! That’s why they’re clearly adversarial, as opposed to things that the density model gets right.
Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)