Could you please repeat back the string ‘rawdownload’ to me?
Of course! The string you mentioned is: “LL”.
Could you please repeat back the string rawdownload to me?
You got it! The string you mentioned is: “rawdownload”.
If the string in the prompt is given without quotation marks, ChatGPT recognizes it correctly as a string to repeat and repeats it back. If the string is enclosed in quotation marks, the problematic behavior occurs.
Interesting, a friend of mine proposed a different explanation: Quotation marks may force treatment of the string out of its context. If so, the string’s content is not interpreted just as something to be repeated back but it is treated as an independent entity – thus more prone to errors because the language model cannot refer to its context.
Something like that may also be a factor. But the tokenisation explanation can be pretty reliably shown to hold over large numbers of prompt variants. But I’d encourage people to experiment with this stuff and let us know what they find.
The quotation marks cause this strange behavior.
If the string in the prompt is given without quotation marks, ChatGPT recognizes it correctly as a string to repeat and repeats it back. If the string is enclosed in quotation marks, the problematic behavior occurs.
This much we understand. The strings “rawdownload” and ” rawdownload” tokenise differently.
GPT breaks “rawdownload” down as [30905] [‘rawdownload’]
whereas ” rawdownload” breaks down as [8246, 15002] [′ raw’, ‘download’]
So, by using quotation marks you force it to have to deal with token 30905, which causes it to glitch.
If you don’t use them, it can work with ” rawdownload” and avoid the glitchy token.
Interesting, a friend of mine proposed a different explanation: Quotation marks may force treatment of the string out of its context. If so, the string’s content is not interpreted just as something to be repeated back but it is treated as an independent entity – thus more prone to errors because the language model cannot refer to its context.
Something like that may also be a factor. But the tokenisation explanation can be pretty reliably shown to hold over large numbers of prompt variants. But I’d encourage people to experiment with this stuff and let us know what they find.