If your text generation algorithm is “repeatedly sample randomly (at a given temperature) from a probability distribution over tokens”, that means you control a stream of bits which don’t matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the “given a prefix, what is the probability distribution over next tokens” engine).
So at that point, you’re looking for “is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a small edit distance from a stream of bits they could create, but where that stream of bits would look random to any outside observer?” I suspect the answer is “yes”.
That said, this technique is definitely not robust to e.g. “translate English text to French and then back to English” and probably not even robust to “change a few tokens here and there”.
Alternatively, there’s the inelegant but effective approach of “maintain an index of all the text you have ever created and search against that index, as the cost to generate the text is at least an order of magnitude higher[1] than the cost to store it for a year or two”.
I see $0.3 / million tokens generated on OpenRouter for llama-3.1-70b-instruct, which is just about the smallest model size I’d imagine wanting to watermark the output for. A raw token is about 2 bytes, but let’s bump that up by a factor of 50 − 100 to account for things like “redundancy” and “backups” and “searchable text indexes take up more space than the raw text”. So spending $1 on generating tokens will result in something like 0.5 GB of data you need to store.
Quickly-accessible data storage costs something like $0.070 / GB / year, so “generate tokens and store them in a searchable place for 5 years” would be about 25-50% more expensive than “generate tokens and throw them away immediately”.
Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive. Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can’t know if the LLM chose the first or second or third most probable token just by looking at the token. That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the “e” before “a” example, or perhaps something more elaborate created using gradient descent), then you can use an LLM’s loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function. The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like “translate to French and back” won’t work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
If your text generation algorithm is “repeatedly sample randomly (at a given temperature) from a probability distribution over tokens”, that means you control a stream of bits which don’t matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the “given a prefix, what is the probability distribution over next tokens” engine).
So at that point, you’re looking for “is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a small edit distance from a stream of bits they could create, but where that stream of bits would look random to any outside observer?” I suspect the answer is “yes”.
That said, this technique is definitely not robust to e.g. “translate English text to French and then back to English” and probably not even robust to “change a few tokens here and there”.
Alternatively, there’s the inelegant but effective approach of “maintain an index of all the text you have ever created and search against that index, as the cost to generate the text is at least an order of magnitude higher[1] than the cost to store it for a year or two”.
I see $0.3 / million tokens generated on OpenRouter for
llama-3.1-70b-instruct
, which is just about the smallest model size I’d imagine wanting to watermark the output for. A raw token is about 2 bytes, but let’s bump that up by a factor of 50 − 100 to account for things like “redundancy” and “backups” and “searchable text indexes take up more space than the raw text”. So spending $1 on generating tokens will result in something like 0.5 GB of data you need to store.Quickly-accessible data storage costs something like $0.070 / GB / year, so “generate tokens and store them in a searchable place for 5 years” would be about 25-50% more expensive than “generate tokens and throw them away immediately”.
Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive.
Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can’t know if the LLM chose the first or second or third most probable token just by looking at the token.
That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the “e” before “a” example, or perhaps something more elaborate created using gradient descent), then you can use an LLM’s loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function.
The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like “translate to French and back” won’t work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Edit: resized images to not be giant
Wow. This is some really interesting stuff. Upvoting your comment.