They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)
At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.
I’ll have to think more about this and what it might mean for their other scaling laws… at the very least, it’s an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
While you’re here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me—Figure 21. Am I right in reading the bottom two lines as ‘seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th’?
e: Another look and I realise Fig 20 shows things much more clearly—never mind, things continue to get easier with token index.
They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)
At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.
I’ll have to think more about this and what it might mean for their other scaling laws… at the very least, it’s an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
While you’re here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me—Figure 21. Am I right in reading the bottom two lines as ‘seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th’?
e: Another look and I realise Fig 20 shows things much more clearly—never mind, things continue to get easier with token index.