<3
egor.timatkov
“It’s a 10% chance which I did 10 times, so it should be 100%”
My guess is that it probably works, and it’s useful to have, but I think the moment that it’s made public in any way, people will break it pretty easily.
I haven’t, no!
It seems interesting, I’ll check it out
Wow. This is some really interesting stuff. Upvoting your comment.
The a-before-e example is just there to explain, in a human readable way, how a watermark works. The main important bit is that each individual section of the text is unlikely to occur according to some objective scale, be it a-before-e, or hashing mod 10, or some other way.
I really like your example of hashing small bits of the text to 0 mod 10 though. I would have to look into how often you can actually edit text this way without significantly changing the meaning, but as soon as that’s done, you can solve for an N and find how much text you need in order to determine the text is watermarked.
Yes, so indexing all generations is absolutely a viable strategy, though like you said, it might be more expensive.
Watermarking by choosing different tokens at a specific temperature might not be as effective (as you touched on), because in order to reverse that, you need the exact input. Even a slight change to the input or the context will shift the probability distribution over the tokens, after all. Which means you can’t know if the LLM chose the first or second or third most probable token just by looking at the token.
That being said, something like this can still be used to watermark text: If the LLM has some objective, text-independent criteria for being watermarked (like the “e” before “a” example, or perhaps something more elaborate created using gradient descent), then you can use an LLM’s loss function to choose some middle ground between maximizing your independent criteria and minimizing the loss function.
The ideal watermark would put markings into the meaning behind the text, not just the words themselves. No idea how that would happen, but in that way you could watermark an idea, and at that point hacks like “translate to French and back” won’t work. Although watermarking the meaning behind text is currently, as far as I know, science fiction.
I haven’t, no. I really wish I could somehow investigate all 3 pillars of a good watermark (Decisiveness, Invisibility, Robustness), but I couldn’t think of any way to quantify a general text watermark’s invisibility. For any given watermark you can technically rate “how invisible it is” by using an LLM’s loss function to see how different the watermarked text is from the original text, but I can’t come up with a way to generalize this.
So unfortunately my analysis was only about the interplay between decisiveness and robustness.
Haha, I didn’t think of that. Funny.