This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
ĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ