We’re not running out of data to train on, just text.
Why did I not need 1 Trillion language examples to speak (debatable) intelligently? I’d suspect the reason is a combination of inherited training examples from my ancestors, but more importantly, language output is only the surface layer.
In order for language models to get much better, I suspect they need to be training on more than just language. It’s difficult to talk intelligently about complex subjects if you’ve only ever read about them. Especially if you have no eyes, ears, or any other sense data. The best language models are still missing crucial context/info which could be gained through video, audio, and robotic IO.
Combined with this post, this would also suggest our hardware can already train more parameters than we need to in order to get much more intelligent models, if we can get that data from non text sources.
We’re not running out of data to train on, just text.
Why did I not need 1 Trillion language examples to speak (debatable) intelligently? I’d suspect the reason is a combination of inherited training examples from my ancestors, but more importantly, language output is only the surface layer.
In order for language models to get much better, I suspect they need to be training on more than just language. It’s difficult to talk intelligently about complex subjects if you’ve only ever read about them. Especially if you have no eyes, ears, or any other sense data. The best language models are still missing crucial context/info which could be gained through video, audio, and robotic IO.
Combined with this post, this would also suggest our hardware can already train more parameters than we need to in order to get much more intelligent models, if we can get that data from non text sources.