AlexMennen comments on GPT-4 Predictions

AlexMennen 18 Feb 2023 4:21 UTC
5 points
1
In the table of parameters, compute, and tokens, compute/(parameters*tokens) is always 6, except in one case where it’s 0.6, one case where it’s 60, and one case where it’s 2.75. Are you sure this is right?
- Stephen McAleese 18 Feb 2023 9:38 UTC
  1 point
  0
  Parent
  Thanks for spotting this.
  I noticed that I originally used the formula $C = 6 D N$ when it should really be $C \approx 6 D N$ because this is the way it’s written in the OpenAI paper Scaling Laws for Neural Language Models (2020). I updated the equation.
  The amount of compute used during training is proportional to the number of parameters and the amount of training data: $C \propto D N \to C \approx k D N \to C \approx 6 D N$ .
  Where there is a conflict between this formula and the table, I think the table should be used because it’s based on empirical results whereas the $C \approx 6 D N$ formula is more like a rule of thumb.
  - AlexMennen 18 Feb 2023 16:43 UTC
    7 points
    1
    Parent
    My point wasn’t that the equation didnt hold perfectly, but that the discrepancies are very suspicious. Two of the three discrepancies were off by exactly 1 order of magnitude, making me fairly confident that they are the result of a typo. (Not sure what’s going on with the other discrepency).
    - Stephen McAleese 18 Feb 2023 18:24 UTC
      5 points
      0
      Parent
      You were right. I forgot the 1B parameter model row so the table was shifted by an order of magnitude. I updated the table so it should be correct now. Thanks for spotting the mistake.