mwatkins comments on SolidGoldMagikarp II: technical details and more recent findings

mwatkins 11 Feb 2023 14:04 UTC
1 point
0
This much we understand. The strings “rawdownload” and ” rawdownload” tokenise differently.
GPT breaks “rawdownload” down as [30905] [‘rawdownload’]
whereas ” rawdownload” breaks down as [8246, 15002] [′ raw’, ‘download’]
So, by using quotation marks you force it to have to deal with token 30905, which causes it to glitch.
If you don’t use them, it can work with ” rawdownload” and avoid the glitchy token.
- Anna Ochab-Marcinek 11 Feb 2023 17:20 UTC
  1 point
  0
  Parent
  Interesting, a friend of mine proposed a different explanation: Quotation marks may force treatment of the string out of its context. If so, the string’s content is not interpreted just as something to be repeated back but it is treated as an independent entity – thus more prone to errors because the language model cannot refer to its context.
  - mwatkins 11 Feb 2023 17:54 UTC
    1 point
    0
    Parent
    Something like that may also be a factor. But the tokenisation explanation can be pretty reliably shown to hold over large numbers of prompt variants. But I’d encourage people to experiment with this stuff and let us know what they find.