I used the estimate from a document named What’s In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.
A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.
Where does the “15B” for GPT-2′s data come from, here? Epoch’s dataset’s guess is that it was trained on 3B tokens for 100 epochs: https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=0
I used the estimate from a document named What’s In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.
A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.