[Question] How tokenization influences prompting?

Boris Kashirin29 Jul 2024 10:28 UTC

9 points

I was thinking about how prompt differs from training data in terms of tokenization. If i am to prompt with “solution:” as opposed to “solution: ” it seems like it can influence the result, as in training data last token contain some information about next token. If there is token “: T” but my prompt ended in “: ” it can be inferred than next token can’t be “T[something]”.

Is this real effect, or I just misunderstand how tokenization works?

Boris Kashirin29 Jul 2024 10:28 UTC

9 points

4 comments1 min readLW link

Language Models AI

Brendan Long 29 Jul 2024 19:52 UTC
7 points
0
This is a real effect, and this article gives an example with URL’s: https://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0be38

“:” and “://” are different tokens in this LLM, so prompting with a URL starting with “http:” gives bad results because it can’t use the “://” token.

Although this can be improved with a technique called “token healing” that essentially steps backwards in the prompt and then allows any next token that starts with the same characters in the prompt (i.e. in the “http:” example, it steps backwards to “http” and allows any continuation that starts with “:” in its first token).

Note that this only applies at the level of tokens, so in your example it’s true that the next token can’t be “: T”, but with standard tokenizers, you’ll also get a token for every substring of your longer tokens, so it could be just “T”. Whether this makes things better or worse depends on which usage was more common/better in the training data.
- gwern 30 Jul 2024 1:49 UTC
  10 points
  5
  Parent
  The number of problems that non-character/byte tokenization causes, whether BPE or WordPiece, never fails to amaze me. What a kettle of worms is that attractive-looking hack to save context window & speed up learning—especially as the models become so smart they otherwise make few errors & it becomes harder to shrug away tokenization pathologies.
  - Boris Kashirin 30 Jul 2024 10:39 UTC
    1 point
    0
    Parent
    Would be funny if hurdle presented by tokenization is somehow responsible for LLM being smarter than expected :) Sounds exactly like kind of curveball reality likes to throw at us from time to time :)
    - gwern 30 Jul 2024 13:58 UTC
      10 points
      4
      Parent
      I definitely think that LLMs are ‘smarter than expected’ for many people due to tokenization, if only because they look at tokenization errors, which are so vivid and clear, and then ignore things like GPQA which are arcane and hard to read, and conclude LLMs are stupid. “It can’t even count the letters in ‘strawberry’, obviously this is all bunk.”

No comments.