Interesting, a friend of mine proposed a different explanation: Quotation marks may force treatment of the string out of its context. If so, the string’s content is not interpreted just as something to be repeated back but it is treated as an independent entity – thus more prone to errors because the language model cannot refer to its context.
Something like that may also be a factor. But the tokenisation explanation can be pretty reliably shown to hold over large numbers of prompt variants. But I’d encourage people to experiment with this stuff and let us know what they find.
This much we understand. The strings “rawdownload” and ” rawdownload” tokenise differently.
GPT breaks “rawdownload” down as [30905] [‘rawdownload’]
whereas ” rawdownload” breaks down as [8246, 15002] [′ raw’, ‘download’]
So, by using quotation marks you force it to have to deal with token 30905, which causes it to glitch.
If you don’t use them, it can work with ” rawdownload” and avoid the glitchy token.
Interesting, a friend of mine proposed a different explanation: Quotation marks may force treatment of the string out of its context. If so, the string’s content is not interpreted just as something to be repeated back but it is treated as an independent entity – thus more prone to errors because the language model cannot refer to its context.
Something like that may also be a factor. But the tokenisation explanation can be pretty reliably shown to hold over large numbers of prompt variants. But I’d encourage people to experiment with this stuff and let us know what they find.