Veedrac comments on A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

Veedrac 27 Feb 2023 11:49 UTC
1 point
0
BPEs are one of the simplest schemes for producing a large, roughly-fairly-weighted-by-frequency set of tokens that compresses arbitrary bytes drawn from a written language training dataset. That’s about all you need to explain things in ML, typically.

Subword tokenization, the linguistically-guided pre-LLM approach, has a history but is comparatively complex, and I don’t think it compresses as well for a given token budget even on fairly normal-looking text.
- JNS 27 Feb 2023 12:15 UTC
  1 point
  0
  Parent
  Thanks, turns out I was not as confused as I thought, I just needed to see the BPE algorithm.