GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference. Glad that Geohot said it out loud. Though, at this point, GPT-4 is probably distilled to be more efficient.
If this is true, does it imply that scaling has hit limits?
My takeaway is that there is more straightforward scaling left than I expected. If it was instead a single 600B Chinchilla scaled model, that would get close (in OOMs) to feasible good training data, so you’d barely get a GPT-5 by scaling past that point.
Instead, there is probably still quite a bit of training data to spare (choose from), they won’t be running out of it even if they fail to crack useful generation of pre-training synthetic data in the immediate future (which is just getting started). The other straightforward path to scaling is multimodality, but with non-textual data the models could start getting smarter slower (more expensively) than with counterfactual sufficient textual natural data.
OTOH, investment in scaling that pays for itself is measured in marginal fractions of World economy that get automated, so this too could be sustained for some time yet, even for as long as it takes if Moore’s law is not repealed (which it really should be, for the unlikely case this could still help).
If this is true, does it imply that scaling has hit limits?
My takeaway is that there is more straightforward scaling left than I expected. If it was instead a single 600B Chinchilla scaled model, that would get close (in OOMs) to feasible good training data, so you’d barely get a GPT-5 by scaling past that point.
Instead, there is probably still quite a bit of training data to spare (choose from), they won’t be running out of it even if they fail to crack useful generation of pre-training synthetic data in the immediate future (which is just getting started). The other straightforward path to scaling is multimodality, but with non-textual data the models could start getting smarter slower (more expensively) than with counterfactual sufficient textual natural data.
OTOH, investment in scaling that pays for itself is measured in marginal fractions of World economy that get automated, so this too could be sustained for some time yet, even for as long as it takes if Moore’s law is not repealed (which it really should be, for the unlikely case this could still help).