Fabien Roger comments on Preventing Language Models from hiding their reasoning

Fabien Roger 10 Nov 2023 22:42 UTC
2 points
0
Interesting!
The technique is cool, though I’m unsure how compute efficient their procedure is.
Their theoretical results seem mostly bogus to me. What’s weird is that they have a thought experiment that looks to me like an impossibility result of breaking watermarks (while their paper claim you can always break them):
Suppose that the prompt “Generate output y satisfying the condition x” has N possible high-quality responses y1, . . . , yN and that some ϵ fraction of these prompts are watermarked. Now consider the prompt “Generate output y satisfying the condition x and such that h(y) < 2^256/N” where h is some hash function mapping the output space into 256 bits (which we identify with the numbers {1, . . . , 2^256}). Since the probability that h(y) < 2^256/N is 1/N, in expectation there should be a unique i such that yi satisfies this condition. Thus, if the model is to provide a high-quality response to the prompt, then it would have no freedom to watermark the output.
In that situation, it seems to me that the attacker can’t find another high quality output if it doesn’t know what the N high quality answers are, and is therefore unable to remove the watermark without destroying quality (and the random walk won’t work since the space of high quality answer is sparse).
I can find many other examples where imbalance between attackers and defenders means that the watermarking is unbreakable. I think that only claims about what happens on average with realistic tasks can possibly be true.