The only point of this token prediction game is as some sort of rough proxy to estimate human brain token prediction ability. I think you may now agree it’s horrible at that, as it would require unknown but significant human training time to unlock the linguistic prediction ability the cortex already has. Human’s poor zero-shot ability at this specific motor visual game (which is the only thing this post tested!) does not imply the human brain doesn’t have powerful token prediction ability (as I predict you already agree, or will shortly).
I highly doubt anybody is reading this and actually interested in the claim that AI is superhuman at this specific weird token probability game and that alone. Are you? The key subtext here—all of the interest—is in the fact that this is a core generic proxy task such that from merely learning token prediction a huge number of actually relevant downstream tasks emerge near automatically.
There is nothing surprising about this—we’ve know for a long long time (since AIXI days), that purely learning to predict the sensory stream is in fact the only universal necessary and sufficient learning task for superintellligence!
AlphaGo vs humans at go is very different in several key respects: firstly (at least some) humans actually have non trivial training (years) in the game itself, so we can more directly compare along more of the training curve. Secondly, Go is not a key component subtask for many economically relevant components of human intelligence in the same way that token prediction is a core training proxy or subtask of core linguistic abilities.
However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text,
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
“However, it’s possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text,”
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
Whoa, hold up. It’s one thing to say that the literature proves that the human brain is doing text prediction. It’s another thing entirely to say that it’s doing it better than GPT-3. What’s the argument for that claim, exactly? I don’t follow the reasoning you give above. It sounds like you are saying something like this:
”Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
“Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.
The only point of this token prediction game is as some sort of rough proxy to estimate human brain token prediction ability. I think you may now agree it’s horrible at that, as it would require unknown but significant human training time to unlock the linguistic prediction ability the cortex already has. Human’s poor zero-shot ability at this specific motor visual game (which is the only thing this post tested!) does not imply the human brain doesn’t have powerful token prediction ability (as I predict you already agree, or will shortly).
I highly doubt anybody is reading this and actually interested in the claim that AI is superhuman at this specific weird token probability game and that alone. Are you? The key subtext here—all of the interest—is in the fact that this is a core generic proxy task such that from merely learning token prediction a huge number of actually relevant downstream tasks emerge near automatically.
There is nothing surprising about this—we’ve know for a long long time (since AIXI days), that purely learning to predict the sensory stream is in fact the only universal necessary and sufficient learning task for superintellligence!
AlphaGo vs humans at go is very different in several key respects: firstly (at least some) humans actually have non trivial training (years) in the game itself, so we can more directly compare along more of the training curve. Secondly, Go is not a key component subtask for many economically relevant components of human intelligence in the same way that token prediction is a core training proxy or subtask of core linguistic abilities.
Actually this is basically just a known fact from neuroscience and general AI knowledge at this point—about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven’t even looked yet, but I’m also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I’m reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can’t possibly be different for the linguistic centers—as there are no hard-coded linguistic centers, there is just generic cortex).
So from this we already know a way to estimate human equivalent perplexity—measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven’t learned all of the emergent downstream tasks yet, so you’d have to bias the benchmark to the tasks the LMs currently can handle.
Whoa, hold up. It’s one thing to say that the literature proves that the human brain is doing text prediction. It’s another thing entirely to say that it’s doing it better than GPT-3. What’s the argument for that claim, exactly? I don’t follow the reasoning you give above. It sounds like you are saying something like this:
”Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED.”
Basically yes.
There are some unstated caveats however. Humans have roughly several orders of magnitude greater data efficiency on the downstream tasks, and part of that involves active sampling—we don’t have time to read the entire internet, but that doesn’t really matter because we can learn efficiently from a well chosen subset of that data. Current LMs just naively read and learn to predict everything, even if that is rather obviously sub-optimal. So humans aren’t training on exactly the same proxy task, but a (better) closely related proxy task.
How do you rule out the possibility that:
some aspects of language prediction are irrelevant for our lives / downstream tasks (e.g. different people would describe the same thing using subtly different word choice and order);
other aspects of language prediction are very important for our lives / downstream tasks (the gestalt of what the person is trying to communicate, the person’s mood, etc.);
an adult human brain is much better at GPT-3 at (2), but much worse than GPT-3 at (1);
The perplexity metric puts a lot of weight on (1);
and thus there are no circuits anywhere in the human brain that can outperform GPT-3 in perplexity.
That would be my expectation. I think human learning has mechanisms that make it sensitive to value-of-information, even at a low level.
If you have only tiny model capacity and abundant reward feedback purely supervised learning wins—as in the first early successes in DL like alexnet and DM’s early agents. This is expected because when each connection/param is super precious you can’t ‘waste’ any capacity by investing it in modeling world bits that don’t have immediate payoff.
But in the real world sensory info vastly dwarfs reward info, so with increasing model capacity UL wins—as in the more modern success of transformers trained with UL. The brain is very far along in that direction—it has essentially unlimited model capacity in comparison.
1/2: The issue is the system can’t easily predict what aspects will be future important much later for downstream tasks.
All that being said I somewhat agree in the sense that perplexity isn’t necessarily the best measure (the best measure being that which best predicts performance on all the downstream tasks).
OK, cool. Well, I don’t buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that’s because they are doing it in a different way than merely as a side-effect of being good at text prediction.
If it was just math, then ok sure. But GPT-3 and related LMs can learn a wide variety of linguistic skills at certain levels of compute/data scale, and I was explicitly referring to a wide (linguistic and related) skill benchmark, with math being a stand in example for linguistic related/adjacent.
And btw, from what I understand GPT-3 learns math from having math problems in it’s training corpus, so it’s not even a great example of “side-effect of being good at text prediction”.