Although watermarking the meaning behind text is currently, as far as I know, science fiction.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:
You can see that the watermark appears to be encoded in the “texture” of the image, but in a way where that texture doesn’t look like the texture of anything in particular—rather, it’s just that a random direction in latent space usually looks like a texture, but unless you know which texture you’re looking for, knowing that the watermark is “one specific texture is amplified” doesn’t really help you identify which images are watermarked.
There are ways to get features of images that are higher-level than textures—one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)
I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.
As you note, things like “how many words have the first e before the first a” are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.
That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like “detecting noncompliance with licensing terms” and less like “detecting rogue behavior by agents”. And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.
Edit: resized images to not be giant
Wow. This is some really interesting stuff. Upvoting your comment.