While there is a limit to the current text datasets, and expanding that with high quality human-generated text would be expensive, I’m afraid that’s not going to be a blocker.
Multimodal training already completely bypasses text-only limitations. Beyond just extracting text tokens from youtube, the video/audio itself could be used as training data. The informational richness relative to text seems to be very high.
Further, as gato demonstrates, there’s nothing stopping one model from spanning hundreds of distinct tasks, and many of those tasks can come from infinite data fountains, like simulations. Learning rigid body physics in isolation isn’t going to teach a model english, but if it’s one of a few thousand other tasks, it could augment the internal model into something more general. (There’s a paper that I have unfortunately forgotten the name of that created a sufficiently large set of permuted tasks that the model could not actually learn to perform each task, and instead had to learn what the task was within the context window. It worked. Despite being toy task permutations, I suspect something like this generalizes at sufficient scale.)
And it appears that sufficiently capable models can refine themselves in various ways. At the moment, the refinement doesn’t cause a divergence in capability, but that’s no guarantee as the models improve.
While there is a limit to the current text datasets, and expanding that with high quality human-generated text would be expensive, I’m afraid that’s not going to be a blocker.
Multimodal training already completely bypasses text-only limitations. Beyond just extracting text tokens from youtube, the video/audio itself could be used as training data. The informational richness relative to text seems to be very high.
Further, as gato demonstrates, there’s nothing stopping one model from spanning hundreds of distinct tasks, and many of those tasks can come from infinite data fountains, like simulations. Learning rigid body physics in isolation isn’t going to teach a model english, but if it’s one of a few thousand other tasks, it could augment the internal model into something more general. (There’s a paper that I have unfortunately forgotten the name of that created a sufficiently large set of permuted tasks that the model could not actually learn to perform each task, and instead had to learn what the task was within the context window. It worked. Despite being toy task permutations, I suspect something like this generalizes at sufficient scale.)
And it appears that sufficiently capable models can refine themselves in various ways. At the moment, the refinement doesn’t cause a divergence in capability, but that’s no guarantee as the models improve.
Very insightful, thanks for the clarification, as dooming as it is.