I think that your a-before-e example is confusing your intuition- a typical watermark that occurs 10% of the time isn’t going to be semantic, it’s more like “this n-gram hashed with my nonce == 0 mod 10”
The a-before-e example is just there to explain, in a human readable way, how a watermark works. The main important bit is that each individual section of the text is unlikely to occur according to some objective scale, be it a-before-e, or hashing mod 10, or some other way. I really like your example of hashing small bits of the text to 0 mod 10 though. I would have to look into how often you can actually edit text this way without significantly changing the meaning, but as soon as that’s done, you can solve for an N and find how much text you need in order to determine the text is watermarked.
I think that your a-before-e example is confusing your intuition- a typical watermark that occurs 10% of the time isn’t going to be semantic, it’s more like “this n-gram hashed with my nonce == 0 mod 10”
The a-before-e example is just there to explain, in a human readable way, how a watermark works. The main important bit is that each individual section of the text is unlikely to occur according to some objective scale, be it a-before-e, or hashing mod 10, or some other way.
I really like your example of hashing small bits of the text to 0 mod 10 though. I would have to look into how often you can actually edit text this way without significantly changing the meaning, but as soon as that’s done, you can solve for an N and find how much text you need in order to determine the text is watermarked.