LawrenceC comments on SolidGoldMagikarp (plus, prompt generation)

LawrenceC 9 Feb 2023 9:04 UTC
4 points
0
Using density models has been tried for defending against adversarial attacks in many domains, including vision and NLP stuff. Unfortunately, it rarely seems to work, because you can often find adversarial examples for both the density model and the classifier (e.g. it was pretty easy to do this for both Redwood’s injury classifier and the fine-tuned GPT-Neo we used for generating text).
- hold_my_fish 9 Feb 2023 20:37 UTC
  1 point
  0
  Parent
  Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
  - LawrenceC 9 Feb 2023 23:02 UTC
    3 points
    0
    Parent
    No! That’s why they’re clearly adversarial, as opposed to things that the density model gets right.
    - hold_my_fish 9 Feb 2023 23:55 UTC
      1 point
      0
      Parent
      Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)