Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)
Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
No! That’s why they’re clearly adversarial, as opposed to things that the density model gets right.
Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)