With the vocabulary V having been fixed, we now have a canonical way of taking any string S of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary V.
Correct me if I’m wrong, but: you don’t actually describe any map Text→V∗? The preceding paragraph explains the Byte-Pair Encoding training algorithm, not theencode step.
The simplified story can be found at the end of the “Implementing BPE” part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don’t understand completely, e.g. what does that regex do?
Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn’t worth expanding on. But yeah I guess you don’t get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.
Correct me if I’m wrong, but: you don’t actually describe any map Text→V∗?
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.
The simplified story can be found at the end of the “Implementing BPE” part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don’t understand completely, e.g. what does that regex do?
Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn’t worth expanding on. But yeah I guess you don’t get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.