Daniel Paleka comments on Notes on the Mathematics of LLM Architectures

Daniel Paleka 9 Feb 2023 15:22 UTC
3 points
1
With the vocabulary $V$ having been fixed, we now have a canonical way of taking any string $S$ of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary $V$ .
Correct me if I’m wrong, but: you don’t actually describe any map $Text \to V^{*} ?$
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.
The simplified story can be found at the end of the “Implementing BPE” part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don’t understand completely, e.g. what does that regex do?
- carboniferous_umbraculum 9 Feb 2023 19:50 UTC
  3 points
  0
  Parent
  Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn’t worth expanding on. But yeah I guess you don’t get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.