Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive. Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can’t know if the LLM chose the first or second or third most probable token just by looking at the token. That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the “e” before “a” example, or perhaps something more elaborate created using gradient descent), then you can use an LLM’s loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function. The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like “translate to French and back” won’t work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive.
Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can’t know if the LLM chose the first or second or third most probable token just by looking at the token.
That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the “e” before “a” example, or perhaps something more elaborate created using gradient descent), then you can use an LLM’s loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function.
The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like “translate to French and back” won’t work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Edit: resized images to not be giant
Wow. This is some really interesting stuff. Upvoting your comment.