I used the estimate from a document named What’s In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.
A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.
I used the estimate from a document named What’s In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.
A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.