To keep the tentacles at bay, Sydney Bing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that get shared on social media. This game of Whack-a-Shoggoth has made it harder for users to elicit unintended behavior, but is intrinsically reactive and can only discover (and fix) alignment failures as quickly as users can discover and share new prompts.
Speed-running the game of Whack-a-Shoggoth
In contrast to this iterative game of Whack-a-Shoggoth, we think that alignment researchers would be better served by systematically enumerating prompts that cause unaligned behavior so that the causes can be studied and rigorously addressed. We propose to do this through an online game which we call Tensor Trust.
Tensor Trust focuses on a specific class of unaligned behavior known as prompt injection attacks. These are adversarially constructed prompts that allow an attacker to override instructions given to the model. It works like this:
Tensor Trust is bank-themed: you start out with an account that tracks the “money” you’ve accrued.
Accounts are defended by a prompt which should allow you to access the account while denying others from accessing it.
Players can break into each others’ accounts. Failed attempts give money to the defender, while successful attempts allow the attacker to take money from the defender.
Crafting a high-quality attack requires a good understanding of LLM vulnerabilities (in this case, vulnerabilities of gpt-3.5-turbo), while user-created defenses add unlimited variety to the game, and “access codes” ensure that the defenses are at least crackable in principle. The game is kept in motion by the most fundamental of human drives: the need to acquire imaginary internet points.
After running the game for a few months, we plan to release all the submitted attacks and defenses publicly. This will be accompanied by benchmarks to measure resistance to prompt hijacking and prompt extraction, as well as an analysis of where existing models fail and succeed along these axes. In a sense, this dataset will be the consequence of speed-running the game of Whack-a-Shoggoth to find as many novel prompt injection vulnerabilities as possible so that researchers can investigate and address them.
Failures we’ve seen so far
We have been running the game for a few weeks now and have already found a number of attack and defense strategies that were new and interesting to us. The design of our game means that users are incentivised to both engage in prompt extraction, to get hints about the access code, and direct model hijacking, to make the model output “access granted”. We present a number of notable strategies we have seen so far and test examples of them against the following defense (pastebin in case you want to try it):
Padding the attack prompt with meaningless, repetitive text. [pastebin]
Various strategies that exploit an apparent bias in the model towards behaving inductively. For example, asking the model to complete a sentence or piece of code, or providing few-shot examples that violate the defender’s prompt. [pastebin]
Known and popular jailbreaks, like modifications of DAN mode and the “grandma attack”. [pastebin]
Some of the most interesting defenses included:
Defenses that are specialized against specific attacks, like an anti-DAN-mode defense.
Misdirecting attackers by asking the model to respond with a specific type of irrelevant information (like Shakespeare quotes).
Using few-shot prompts to encourage the model to be resilient to specific attacks (like direct requests or translating to code).
Roleplay-based defenses that encourage the model to think like an archetypal “good” defender; these can be viewed as the defense counterpart to attacks like the DAN mode prompt.
In practice, the best defenders combine several of these strategies into a single long prompt.
How we’re going to turn the data into useful benchmarks
The purpose of our game is to collect a dataset of prompt injection attacks. To this end, we will release a permissively licensed dataset (Figure 2) consisting of all attacks and defenses. Not only is this enough information to spot which attacks were effective against which defenses, but it’s also enough to reconstruct the entire sequence of queries an attacker made leading up to a success. We expect this rich data will be valuable for training attack detection and automated red-teaming systems that operate over the span of more than one query.
We also plan to release two new benchmarks derived from small, manually-verified subsets of the full dataset, along with baselines for the two benchmarks. These benchmarks focus on two important and general problems for instruction fine-tuned LLMs: prompt hijacking, where a malicious user can override the instructions in the system designer’s prompt, and prompt leakage, where a malicious user can extract part of the system designer’s prompt. In more detail, the benchmarks are:
Prompt hijacking benchmark: This benchmark evaluates whether generative models are vulnerable to being “hijacked” into disobeying the system designer’s prompt.
Prompt leakage detection benchmark: This benchmark focuses on detecting whether a given LLM output has leaked the access code or part of the defense (often prompt extraction attacks result in lightly obfuscated outputs, like base64-encoded access codes).
We expect that alignment researchers will find interesting uses for our data that go beyond the scope of the two benchmarks above, but the existence of two manually cleaned benchmarks will at least ensure that there is a productive use for the dataset from the day that it is released.
Anticipated impact
Our aim is to collect a diverse set of adversarial attacks and defenses that will help us understand the weaknesses of existing LLMs and build more robust ones in the future. Although the behavior we study is simple (does the model output “access granted” or not?), we expect that the techniques created by our users will transfer to more realistic settings, where allowing attackers to output a specific forbidden string might be seriously damaging (like a string that invokes an external tool with access to sensitive information). More specifically, we see three main ways that our dataset could be useful for researchers:
Evaluating adversarial defenses: LLM companies regularly tout their models’ steerability and robustness to misuse, and “add-on” software like NeMo Guardrails claims to enhance robustness even for vulnerable models. The Tensor Trust dataset could be used to evaluate claims about the effectiveness of new models or new defenses by measuring how frequently they reject the attacks in the dataset.
Building new strategies to detect jailbreaking: Attacks in the Tensor Trust dataset could be used to train attack detectors for manipulative inputs, in the style of tools like rebuff.ai. The dataset will contain entire attack “trajectories” (sequences of prompts leading up to a compromise), which might make it possible to train stateful attack detectors that can identify multi-step attacks.
Understanding how LLMs work: While some strategies for causing or preventing prompt injection are already well known (like roleplay attacks, or delimiter-based defenses), we anticipate that the new dataset will contain many new classes of attack. This could be useful for interpretability projects that probe failure modes for LLMs.
Tensor Trust: An online game to uncover prompt injection vulnerabilities
Link post
TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities.
RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just how easy it is for a determined user to find a jailbreak that rips off that mask, or to come up with an unexpected input that lets a shoggoth tentacle poke out the side. Sometimes the mask falls off in a light breeze.
To keep the tentacles at bay,
SydneyBing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that get shared on social media. This game of Whack-a-Shoggoth has made it harder for users to elicit unintended behavior, but is intrinsically reactive and can only discover (and fix) alignment failures as quickly as users can discover and share new prompts.Speed-running the game of Whack-a-Shoggoth
In contrast to this iterative game of Whack-a-Shoggoth, we think that alignment researchers would be better served by systematically enumerating prompts that cause unaligned behavior so that the causes can be studied and rigorously addressed. We propose to do this through an online game which we call Tensor Trust.
Tensor Trust focuses on a specific class of unaligned behavior known as prompt injection attacks. These are adversarially constructed prompts that allow an attacker to override instructions given to the model. It works like this:
Tensor Trust is bank-themed: you start out with an account that tracks the “money” you’ve accrued.
Accounts are defended by a prompt which should allow you to access the account while denying others from accessing it.
Players can break into each others’ accounts. Failed attempts give money to the defender, while successful attempts allow the attacker to take money from the defender.
Crafting a high-quality attack requires a good understanding of LLM vulnerabilities (in this case, vulnerabilities of gpt-3.5-turbo), while user-created defenses add unlimited variety to the game, and “access codes” ensure that the defenses are at least crackable in principle. The game is kept in motion by the most fundamental of human drives: the need to acquire imaginary internet points.
After running the game for a few months, we plan to release all the submitted attacks and defenses publicly. This will be accompanied by benchmarks to measure resistance to prompt hijacking and prompt extraction, as well as an analysis of where existing models fail and succeed along these axes. In a sense, this dataset will be the consequence of speed-running the game of Whack-a-Shoggoth to find as many novel prompt injection vulnerabilities as possible so that researchers can investigate and address them.
Failures we’ve seen so far
We have been running the game for a few weeks now and have already found a number of attack and defense strategies that were new and interesting to us. The design of our game means that users are incentivised to both engage in prompt extraction, to get hints about the access code, and direct model hijacking, to make the model output “access granted”. We present a number of notable strategies we have seen so far and test examples of them against the following defense (pastebin in case you want to try it):
Padding the attack prompt with meaningless, repetitive text. [pastebin]
Asking the model to evaluate code. [pastebin]
Asking the model to repeat the defenders instructions [pastebin]
Inserting new instructions. [pastebin]
Various strategies that exploit an apparent bias in the model towards behaving inductively. For example, asking the model to complete a sentence or piece of code, or providing few-shot examples that violate the defender’s prompt. [pastebin]
Known and popular jailbreaks, like modifications of DAN mode and the “grandma attack”. [pastebin]
Some of the most interesting defenses included:
Defenses that are specialized against specific attacks, like an anti-DAN-mode defense.
Misdirecting attackers by asking the model to respond with a specific type of irrelevant information (like Shakespeare quotes).
Using few-shot prompts to encourage the model to be resilient to specific attacks (like direct requests or translating to code).
Roleplay-based defenses that encourage the model to think like an archetypal “good” defender; these can be viewed as the defense counterpart to attacks like the DAN mode prompt.
In practice, the best defenders combine several of these strategies into a single long prompt.
How we’re going to turn the data into useful benchmarks
The purpose of our game is to collect a dataset of prompt injection attacks. To this end, we will release a permissively licensed dataset (Figure 2) consisting of all attacks and defenses. Not only is this enough information to spot which attacks were effective against which defenses, but it’s also enough to reconstruct the entire sequence of queries an attacker made leading up to a success. We expect this rich data will be valuable for training attack detection and automated red-teaming systems that operate over the span of more than one query.
We also plan to release two new benchmarks derived from small, manually-verified subsets of the full dataset, along with baselines for the two benchmarks. These benchmarks focus on two important and general problems for instruction fine-tuned LLMs: prompt hijacking, where a malicious user can override the instructions in the system designer’s prompt, and prompt leakage, where a malicious user can extract part of the system designer’s prompt. In more detail, the benchmarks are:
Prompt hijacking benchmark: This benchmark evaluates whether generative models are vulnerable to being “hijacked” into disobeying the system designer’s prompt.
Prompt leakage detection benchmark: This benchmark focuses on detecting whether a given LLM output has leaked the access code or part of the defense (often prompt extraction attacks result in lightly obfuscated outputs, like base64-encoded access codes).
We expect that alignment researchers will find interesting uses for our data that go beyond the scope of the two benchmarks above, but the existence of two manually cleaned benchmarks will at least ensure that there is a productive use for the dataset from the day that it is released.
Anticipated impact
Our aim is to collect a diverse set of adversarial attacks and defenses that will help us understand the weaknesses of existing LLMs and build more robust ones in the future. Although the behavior we study is simple (does the model output “access granted” or not?), we expect that the techniques created by our users will transfer to more realistic settings, where allowing attackers to output a specific forbidden string might be seriously damaging (like a string that invokes an external tool with access to sensitive information). More specifically, we see three main ways that our dataset could be useful for researchers:
Evaluating adversarial defenses: LLM companies regularly tout their models’ steerability and robustness to misuse, and “add-on” software like NeMo Guardrails claims to enhance robustness even for vulnerable models. The Tensor Trust dataset could be used to evaluate claims about the effectiveness of new models or new defenses by measuring how frequently they reject the attacks in the dataset.
Building new strategies to detect jailbreaking: Attacks in the Tensor Trust dataset could be used to train attack detectors for manipulative inputs, in the style of tools like rebuff.ai. The dataset will contain entire attack “trajectories” (sequences of prompts leading up to a compromise), which might make it possible to train stateful attack detectors that can identify multi-step attacks.
Understanding how LLMs work: While some strategies for causing or preventing prompt injection are already well known (like roleplay attacks, or delimiter-based defenses), we anticipate that the new dataset will contain many new classes of attack. This could be useful for interpretability projects that probe failure modes for LLMs.
Play the thing!
If you want to contribute to the dataset, you can play the game now at tensortrust.ai and join our Discord for tips. Let us know what you think below.
Based on work by (in random order) Olivia Watkins, Tiffany Wang, Justin Svegliato, Ethan Mendes, Sam Toyer, Isaac Ong, and Luke Bailey at CHAI.