julianjm comments on Extrapolating GPT-N performance

julianjm 22 Dec 2020 8:01 UTC
2 points
10x seems reasonable on its face, but honestly I have no idea. We haven’t really dealt with scales and feature learners like this before. I assume a big part of what the model is doing is learning good representations that allow it to learn more/better from each example as training goes on. Given that, I can imagine arguments either way. On one hand, good representations could mean the model is discerning on its own what’s important (so maybe data cleaning doesn’t matter much). On the other, maybe noisy data (say, with lots of irreducible entropy—though that’s not necessarily what “garbage text” looks like, indeed often the opposite, but I guess it depends how you filter in practice) could take up disproportionately large amounts of model capacity & training signal as representations of “good” (ie compressible) data get better, thereby adding a bunch of noise to training and slowing it down. These are just random intuitive guesses though. Seems like an empirical question and might depend a lot on the details.