hold_my_fish comments on SolidGoldMagikarp (plus, prompt generation)

hold_my_fish 6 Feb 2023 19:27 UTC
1 point
0
Regarding the prompt generation, I wonder whether anomalous prompts could be detected (and rejected if desired). After all, GPT can estimate a probability for any given text. That makes them different from typical image classifiers, which don’t model the input distribution.
- LawrenceC 9 Feb 2023 9:04 UTC
  4 points
  0
  Parent
  Using density models has been tried for defending against adversarial attacks in many domains, including vision and NLP stuff. Unfortunately, it rarely seems to work, because you can often find adversarial examples for both the density model and the classifier (e.g. it was pretty easy to do this for both Redwood’s injury classifier and the fine-tuned GPT-Neo we used for generating text).
  - hold_my_fish 9 Feb 2023 20:37 UTC
    1 point
    0
    Parent
    Interesting, thanks. That makes me curious: about the adversarial text examples that trick the density model, do they look intuitively ‘natural’ to us as humans?
    - LawrenceC 9 Feb 2023 23:02 UTC
      3 points
      0
      Parent
      No! That’s why they’re clearly adversarial, as opposed to things that the density model gets right.
      - hold_my_fish 9 Feb 2023 23:55 UTC
        1 point
        0
        Parent
        Thanks. (The alternative I was thinking of is that the prompt might look okay but cause the model to output a continuation that’s surprising and undesirable.)