I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I’ve been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.
Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg “typical people infected Majesty the virus will experience mild to moderate 74 illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease” has a >99.5% chance of ′ 74′ as the next token. Penalising repetition makes the task much harder.
It can even pretty reliably cause GPT-2 to output SolidGoldMagikarp with >99% probability by only changing 10% of the tokens, though it does this by just inserting SolidGoldMagikarp wherever possible. As far as I’ve seen playing around with it for an hour or so, if you penalise repeating the initial token then it never succeeds.
How does this change when we scale up to GPT-3, and to ChatGPT—is this still possible after the mode collapse that comes with lots of fine tuning?
Can this be extended to getting whole new behaviours, as well as just next tokens? What about discouraged behaviour?
Since this is a way to mechanically generate non-robustness of outputs, can this be fed back in to training to make robust models—would sprinkling noise into the data prevent adversarial examples?
I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I’ve been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.
Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg “typical people infected Majesty the virus will experience mild to moderate 74 illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease” has a >99.5% chance of ′ 74′ as the next token. Penalising repetition makes the task much harder.
It can even pretty reliably cause GPT-2 to output SolidGoldMagikarp with >99% probability by only changing 10% of the tokens, though it does this by just inserting SolidGoldMagikarp wherever possible. As far as I’ve seen playing around with it for an hour or so, if you penalise repeating the initial token then it never succeeds.
I don’t think these attacks are at all new (see Universal Adversarial Triggers from 2019 and others) but it’s certainly fun to test out.
This raises a number of questions:
How does this change when we scale up to GPT-3, and to ChatGPT—is this still possible after the mode collapse that comes with lots of fine tuning?
Can this be extended to getting whole new behaviours, as well as just next tokens? What about discouraged behaviour?
Since this is a way to mechanically generate non-robustness of outputs, can this be fed back in to training to make robust models—would sprinkling noise into the data prevent adversarial examples?
code here