ChrisCundy comments on SolidGoldMagikarp (plus, prompt generation)

ChrisCundy 6 Feb 2023 18:28 UTC
2 points
0
Would you be able to elaborate a bit on your process for adversarially attacking the model?
It sounds like a combination of projected gradient descent and clustering? I took a look at the code but a brief mathematical explanation / algorithm sketch would help a lot!
Myself and a couple of colleagues are thinking about this approach to demonstrate some robustness failures in LLMs, it would be great to build off your work.
- Jessica Rumbelow 6 Feb 2023 21:18 UTC
  4 points
  0
  Parent
  Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)
  - ChrisCundy 6 Feb 2023 22:22 UTC
    1 point
    0
    Parent
    Thanks for the elaboration, I’ll follow up offline