What happens when OpenAI simply expands this method of token prediction to train with every kind of correlated multi-media on the internet? Audio, video, text, images, semantic web ontologies, and scientific data. If they also increase the buffer size and token complexity, how close does this get us to AGI?
While other media would undoubtedly improve the model’s understanding of concepts hard to express through text, I’ve never bought the idea that it would do much for AGI. Text has more than enough in it to capture intelligent thought; it is the relations and structure that matters, above all else. If this weren’t true, one wouldn’t expect competent deafblind people, but there are. Their successes are even in spite of an evolutionary history with practically no surviving deafblind ancestors! Clearly the modules that make humans intelligent, in a way that other animals and things are not, are not dependent on multisensory data.
A few of points. First I’ve heard several AI researchers say that GPT-3 is already close to the limit of all high quality human generated text data. While the amount of text on the internet will continue to grow, it might not grow fast enough for major continued improvement. Thus additional media might be necessary for training input.
Second deaf blind people still have multiple senses that allow them to build 3D sensory-motor models of reality (touch, smell, taste, proprioception, vestibular, sound vibrations). Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don’t have to be learned.
Third, as DALL-E illustrates, intelligence is not just about learning knowledge it is also about expressing that learning in a medium. It is hard to see how an AI trained only on text could paint a picture or sing a song.
I expect getting a dataset an order of magnitude larger than The Pile without significantly compromising on quality will be hard, but not impractical. Two orders of magnitude (~100 TB) would be extremely difficult, if even feasible. But it’s not clear that this matters; per Scaling Laws, dataset requirements grow more slowly than model size, and a 10 TB dataset would already be past the compute-data intersection point they talk about.
Note also that 10 TB of text is an exorbitant amount. Even if there were a model that would hit AGI with, say, a PB of text, but not with 10 TB of text, it would probably also hit AGI with 10 TB of text plus some fairly natural adjustments to its training regime to inhibit overfitting. I wouldn’t argue this all the way down to human levels of data, since the human brain has much more embedded structure than we assume for ANNs, but certainly huge models like GPT-3 start to learn new concepts in only a handful of updates, and I expect that trend of greater learning efficiency to continue.
I’m also skeptical that images, video, and such would substantially change the picture. Images are very information sparse. Consider the amount you can learn from 1MB of text, versus 1MB of pixels.
Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don’t have to be learned.
Correlation is not causation ;). I think it’s plausible that agenthood would help progress towards some of those ideas, but that doesn’t much argue for multiple distinct senses. You can find mere correlations just fine with only one.
It’s true that even a deafblind person will have mental structures that evolved for sight and hearing, but that’s not much of an argument that it’s needed for intelligence, and given the evidence (lack of mental impairment in deafblind people), a strong argument seems necessary.
For sure I’ll accept that you’ll want to train multimodal agents anyway, to round out their capabilities. A deafblind person might still be intellectually capable, but it doesn’t mean they can paint.
What happens when OpenAI simply expands this method of token prediction to train with every kind of correlated multi-media on the internet? Audio, video, text, images, semantic web ontologies, and scientific data. If they also increase the buffer size and token complexity, how close does this get us to AGI?
While other media would undoubtedly improve the model’s understanding of concepts hard to express through text, I’ve never bought the idea that it would do much for AGI. Text has more than enough in it to capture intelligent thought; it is the relations and structure that matters, above all else. If this weren’t true, one wouldn’t expect competent deafblind people, but there are. Their successes are even in spite of an evolutionary history with practically no surviving deafblind ancestors! Clearly the modules that make humans intelligent, in a way that other animals and things are not, are not dependent on multisensory data.
A few of points. First I’ve heard several AI researchers say that GPT-3 is already close to the limit of all high quality human generated text data. While the amount of text on the internet will continue to grow, it might not grow fast enough for major continued improvement. Thus additional media might be necessary for training input.
Second deaf blind people still have multiple senses that allow them to build 3D sensory-motor models of reality (touch, smell, taste, proprioception, vestibular, sound vibrations). Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don’t have to be learned.
Third, as DALL-E illustrates, intelligence is not just about learning knowledge it is also about expressing that learning in a medium. It is hard to see how an AI trained only on text could paint a picture or sing a song.
I expect getting a dataset an order of magnitude larger than The Pile without significantly compromising on quality will be hard, but not impractical. Two orders of magnitude (~100 TB) would be extremely difficult, if even feasible. But it’s not clear that this matters; per Scaling Laws, dataset requirements grow more slowly than model size, and a 10 TB dataset would already be past the compute-data intersection point they talk about.
Note also that 10 TB of text is an exorbitant amount. Even if there were a model that would hit AGI with, say, a PB of text, but not with 10 TB of text, it would probably also hit AGI with 10 TB of text plus some fairly natural adjustments to its training regime to inhibit overfitting. I wouldn’t argue this all the way down to human levels of data, since the human brain has much more embedded structure than we assume for ANNs, but certainly huge models like GPT-3 start to learn new concepts in only a handful of updates, and I expect that trend of greater learning efficiency to continue.
I’m also skeptical that images, video, and such would substantially change the picture. Images are very information sparse. Consider the amount you can learn from 1MB of text, versus 1MB of pixels.
Correlation is not causation ;). I think it’s plausible that agenthood would help progress towards some of those ideas, but that doesn’t much argue for multiple distinct senses. You can find mere correlations just fine with only one.
It’s true that even a deafblind person will have mental structures that evolved for sight and hearing, but that’s not much of an argument that it’s needed for intelligence, and given the evidence (lack of mental impairment in deafblind people), a strong argument seems necessary.
For sure I’ll accept that you’ll want to train multimodal agents anyway, to round out their capabilities. A deafblind person might still be intellectually capable, but it doesn’t mean they can paint.