You also may want to checkout Universal Adversarial Triggers https://arxiv.org/abs/1908.07125, which is an academic paper from 2019 that does the same thing as the above, where they craft the optimal worst-case prompt to feed into a model. And then they use the prompt for analyzing GPT-2 and other models.
I just skimmed that paper, but I think it doesn’t find these tokens like ” SolidGoldMagikarp” that have the strange sort of behaviour described in this post. Am I missing something, or by “the exact same thing as the above” were you just referring to one particular section of the post?
You also may want to checkout Universal Adversarial Triggers https://arxiv.org/abs/1908.07125, which is an academic paper from 2019 that does the same thing as the above, where they craft the optimal worst-case prompt to feed into a model. And then they use the prompt for analyzing GPT-2 and other models.
I just skimmed that paper, but I think it doesn’t find these tokens like ” SolidGoldMagikarp” that have the strange sort of behaviour described in this post. Am I missing something, or by “the exact same thing as the above” were you just referring to one particular section of the post?
Thanks—wasn’t aware of this!