I respect Sutskever a lot, but if he believed that he could get an equivalent world model by spending an equivalent amount of compute learning from next-token prediction using any other set of real-world data samples, why would they go to such lengths to specifically obtain human-generated text for training? They might as well just do lots of random recordings (e.g., video, audio, radio signals) and pump it all into the model. In principle that could probably work, but it’s very inefficient.
Human language is a very high density encoding of world models, so by training on human language models get much of their world model “for free“, because humanity has already done a lot of pre-work by sampling reality in a wide variety of ways and compressing it into the structure of language. However, our use of language still doesn’t capture all of reality exactly and I would argue it’s not even close. (Saying otherwise is equivalent to saying we’ve already discovered almost all possible capabilities, which would entail that AI actually has a hard cap at roughly human ability.)
In order to expand its world model beyond human ability, AI has to sample reality itself, which is much less sample-efficient than sampling human behavior, hence the “soft cap”.
I respect Sutskever a lot, but if he believed that he could get an equivalent world model by spending an equivalent amount of compute learning from next-token prediction using any other set of real-world data samples, why would they go to such lengths to specifically obtain human-generated text for training? They might as well just do lots of random recordings (e.g., video, audio, radio signals) and pump it all into the model. In principle that could probably work, but it’s very inefficient.
Human language is a very high density encoding of world models, so by training on human language models get much of their world model “for free“, because humanity has already done a lot of pre-work by sampling reality in a wide variety of ways and compressing it into the structure of language. However, our use of language still doesn’t capture all of reality exactly and I would argue it’s not even close. (Saying otherwise is equivalent to saying we’ve already discovered almost all possible capabilities, which would entail that AI actually has a hard cap at roughly human ability.)
In order to expand its world model beyond human ability, AI has to sample reality itself, which is much less sample-efficient than sampling human behavior, hence the “soft cap”.