Lukas Finnveden comments on GPT-4 Predictions

Lukas Finnveden 15 Jun 2023 23:53 UTC
2 points
0
GPT-2 1.5B 15B 2.5794
Where does the “15B” for GPT-2′s data come from, here? Epoch’s dataset’s guess is that it was trained on 3B tokens for 100 epochs: https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=0
- Stephen McAleese 16 Jun 2023 9:50 UTC
  3 points
  0
  Parent
  I used the estimate from a document named What’s In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.
  A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.