In the table of parameters, compute, and tokens, compute/(parameters*tokens) is always 6, except in one case where it’s 0.6, one case where it’s 60, and one case where it’s 2.75. Are you sure this is right?
I noticed that I originally used the formula C=6DN when it should really be C≈6DN because this is the way it’s written in the OpenAI paper Scaling Laws for Neural Language Models (2020). I updated the equation.
The amount of compute used during training is proportional to the number of parameters and the amount of training data: C∝DN→C≈kDN→C≈6DN.
Where there is a conflict between this formula and the table, I think the table should be used because it’s based on empirical results whereas the C≈6DN formula is more like a rule of thumb.
My point wasn’t that the equation didnt hold perfectly, but that the discrepancies are very suspicious. Two of the three discrepancies were off by exactly 1 order of magnitude, making me fairly confident that they are the result of a typo. (Not sure what’s going on with the other discrepency).
You were right. I forgot the 1B parameter model row so the table was shifted by an order of magnitude. I updated the table so it should be correct now. Thanks for spotting the mistake.
In the table of parameters, compute, and tokens, compute/(parameters*tokens) is always 6, except in one case where it’s 0.6, one case where it’s 60, and one case where it’s 2.75. Are you sure this is right?
Thanks for spotting this.
I noticed that I originally used the formula C=6DN when it should really be C≈6DN because this is the way it’s written in the OpenAI paper Scaling Laws for Neural Language Models (2020). I updated the equation.
The amount of compute used during training is proportional to the number of parameters and the amount of training data: C∝DN→C≈kDN→C≈6DN.
Where there is a conflict between this formula and the table, I think the table should be used because it’s based on empirical results whereas the C≈6DN formula is more like a rule of thumb.
My point wasn’t that the equation didnt hold perfectly, but that the discrepancies are very suspicious. Two of the three discrepancies were off by exactly 1 order of magnitude, making me fairly confident that they are the result of a typo. (Not sure what’s going on with the other discrepency).
You were right. I forgot the 1B parameter model row so the table was shifted by an order of magnitude. I updated the table so it should be correct now. Thanks for spotting the mistake.