Expressed differently, my point is that this isn’t even a comparison of two systems on the same task! The LM is directly connected to the text stream and is trained as a system with that dataflow. The human is performing a completely different, much more complex task involving vision and motor subtasks in addition to the linguistic subtask, and then is tested on zero/few shot performance on this new integrated task with little/insignificant transfer training.
If you want to make any valid perf comparison claims you need to compare systems on the exact same full system task. To do that you’d need to take the ANN LM and hook it up to a (virtual or real) robot where the input is the same rendered video display the human receives, and the output is robotic motor commands to manipulate a mouse to play the software version of this text prediction game. Do you really think the ANN LM would beat humans in this true apples to apples comparison?
I feel like the implied conclusion you’re arguing against here is something like “LM’s are more efficient per sample than the human brain at predicting language”,
What? Not only is that not what I”m arguing, the opposite is true! Humans are obviously more sample efficient than current ANN LMs—by a few OOM (humans exceed LM capability after less than 1B tokens equivalent, roughly speaking). Human learning is far more sample efficient on all the actually ecologically/economically important downstream linguistic tasks, so it stands to reason that the linguistic brain regions probably also are more sample efficient at sequence prediction—although that remains untested. Again, this technique does not actually measure the true linguistic prediction ability of the cortex.
I think the conclusion is exactly as stated—in the real world, LM’s outperform humans.
At what? Again, in this example the humans and the LM are not even remotely performing the same task. The human task in this example is an enormously more complex computational few shot learning task. We can theorize about more complex agentic systems that could perform the actual exact same task robotic/visual task here (perhaps gato?), but that isn’t what this post discusses.
This isn’t a useful task—there is no inherent value in next-token prediction. It’s just a useful proxy training measure. Human brains are likely using the same unsupervised proxy internally, but actually measuring that directly would involve some complex and invasive neural interfaces (perhaps some future version of neuralink).
But that’s unnecessary because instead we can roughly estimate human brain performance based via comparison on all the various downstream actually economically/ecologically useful tasks such as reading comprehension, math/linguistic problems, QA tasks, story writing, etc. And the general conclusion is the human brain is very roughly an OOM or two more ‘compute’ efficient and several OOM more data efficient than today’s LMs. (for now)
I think I should have emphasised the word “against” in the sentence of mine you quoted:
I feel like the implied conclusion you’re arguing against here is something like “LM’s are more efficient per sample than the human brain at predicting language”,
You replied with: “What? Not only is that not what I”m arguing, the opposite is true!” which was precisely what I was saying. The conclusion you’re arguing against is that LM’s are more sample-efficient than humans. This would require you to take the opposite stance—that humans are more sample-efficient than LM’s. This is the exact stance I believed you were taking. I then went on to say that the text did not rely on this assumption, and therefore your argument, while correct, did not affect the post’s conclusions.
I agree with you that humans are much more sample-efficient than LM’s. I have no compute comparison for human brains and ML models so I’ll take your word on compute efficiency. And I agree that the task the human is doing is more complicated. Humans would dominate modern ML systems if you limited those systems to the data, compute, and sensory inputs that humans get.
I think our major crux is that I don’t see this as particularly important. It’s not a fair head-to-head comparison, but it’s never going to be in the real world. What I personally care about is what these machines are capable of doing, not how efficient they are when doing it or what sensory requirements they have to bypass. If a machine can do a task better than a human can, it doesn’t matter if it’s a fair comparison, provided we can replicate and/or scale this machine to perform the task in the real world. Efficiency matters since it determines cost and speed, but then the relevant factor is “Is this sufficiently fast / cost-effective to use in the real world”, not “How does efficiency compare to the human brain”.
Put it this way: calculators don’t have to use neurons to perform mathematics, and you can put numbers into them directly instead of them having to hear or read them. So it’s not really a valid comparison to say calculators are directly superhuman at arithmetic. And yet, that doesn’t stop us at all from using calculators to perform such calculations much faster and more accurately than any human, because we don’t actually need to restrict computers to human senses. So, why does it matter that a calculator would lose to a human in a “valid” arithmetic contest? How is that more important than what the calculator can actually do under normal calculator-use conditions?
Humans would dominate modern ML systems if you limited those systems to the data, compute, and sensory inputs that humans get.
I think our major crux is that I don’t see this as particularly important.
No, I don’t strongly disagree there, the crux is perhaps more fundamental.
Earlier you said:
I think the conclusion is exactly as stated—in the real world, LM’s outperform humans.
At what?
The post title says “Language models seem to be much better than humans at next-token prediction”. I claim this is most likely false, and you certainly can’t learn much anything about true relative human vs LM next-token prediction ability by evaluating the LM on task A compared to evaluating the human on vastly more complex and completely different task B.
Furthermore, I claim that humans are actually probably somewhere between on par to much better than current LMs at next-token prediction. Next-token prediction is only useful as a proxy unsupervised training task which then enables performance on actually relevant downstream linguistic tasks (reading, writing, math, etc). Actual human performance on those downstream tasks currently dominates LM performance, even when LMs use far more compute and essentially all the world’s data—still not quite enough to beat humans (yet!).
Once again, to correctly directly evaluate human next-token prediction ability would probably require a very sophisticated neural interface for high frequency whole brain recording. Lacking that, our best way to estimate true human next-token prediction perplexity is by measuring human performance on the various downstream tasks which emerge from that foundational prediction ability.
What it feels like to me here is that we’re both arguing different sides, and yet if you asked both of us about any empirical fact we expect to see using current or near-future technology, such as “Would a human achieve a better or worse score on a next-token prediction task under X conditions”, we would both agree with each other.
Thus, it feels to me, upon reflection, that our argument isn’t particularly important, unless we could identify an actual prediction that matters where we would each differ. What would it imply, to say that humans are better at next-token prediction ability than this test predicts, compared to the world where that’s not the case and the human subconscious is no better at word-prediction than our conscious mind has access too?
Again you said earlier that ” in the real world, LM’s outperform humans.”
We disagree on that, and I’m still not sure why as you haven’t addressed my crux or related claims. So if in on reflection, you agree that LM’s don’t actually outperform humans (yet), then sure we agree. Do you believe that LM’s generally outperform humans at important real world linguistic skills such as reading, writing, math, etc?
I’m not even sure what you mean by ‘next-token prediction task’ - do you mean the language modelling game described in this post? As again one of my core points is that LMs score essentially 0 on that game, as they can’t even play it. I suspect you could probably train GATO or some similar multi-modal agent to play it, but that wasn’t tested here.
What would it imply, to say that humans are better at next-token prediction ability than this test predicts, compared to the world where that’s not the case and the human subconscious is no better at word-prediction than our conscious mind has access too?
A great deal. If this game was an accurate test of human brain next-token prediction ability, then either humans wouldn’t have the language skills we are using now, or much of modern neuroscience & DL would be wrong as the brain would be somehow learning complex linguistic skills without bootstrapping from unsupervised sequence prediction.
Since LMs actually score zero on this language modelling game, it also ‘predicts’ that they have zero next token prediction ability.
I think the reason we appear to disagree here is that we’re both using different measurements of “outperform”.
My understanding is that Jacob!outperform means to win in a contest where all other variables are the same—thus, you can’t say that LM’s outperform humans when they don’t need to apply the visual and motor skills humans have to do. The interfaces aren’t the same, so the contest is not fair. If I score higher than you in a tenpin bowling match where I have the safety rails up and you don’t, we can’t say I’ve outperformed you in tenpin bowling.
Jay!outperform means to do better on a metric (such as “How often can you select the next word”) where each side is using an interface suited for them that would correlate with the ability to perform the task on a wide range of inputs. That is to say—it’s fine for the computer to cheat, so long as that cheating doesn’t prevent it from completing the task out of distribution, or the distribution is wide enough to handle anything a user is likely to want from the program in a commercial setting. If the AI was only trained on a small corpus and learned to simply memorise the entire corpus, that wouldn’t count as outperforming because the AI would fall apart if we tried to use it on any other text. But since the task we want to check is text prediction, not visual input or robotics, it doesn’t matter that the AI doesn’t have to see the words.
Both these definitions have their place. Saying “AI can Jacob!outperform humans at this task” would mean the AI was closer to AGI than if it can Jay!outperform humans at a task. I can also see how Jacob!outperforming would be possible for a truly general intelligence. However, saying “AI can Jay!outperform humans at this task” is sufficient for AI to begin replacing humans at that task if that task is valuable. (I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think I may have also misled you when I said “in the real world”, and I apologise for that. What I meant by “in the real world” was something like “In practice, the AI can be used reliably in a way that does better than humans”. Again, the calculator is a good metaphor here—we can reliably use calculators to do more accurate arithmetic than humans for actual problems that humans face every day. I understand how “in the real world” could be read as more like “embodied in a robotic form, interacting with the environment like humans do.” Clearly a calculator can’t do that and never will. I agree LM’s cannot currently do that, and we have no indication that any ML system can currently do so at that level.
So, in summary, here is what I am saying:
If an AI can score higher than a human, using any sort of interface that still allows it to be reliably deployed on a range of data that humans want the task performed in, it can be used for that purpose. This can happen whether the AI is able to utilise human senses or not. If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
That said, I think I understand your point better now. A system that could walk across the room in a physical body, turn on a computer, log on to redwoodresearch.com/next_word_prediction_url, view words on a screen, and click a mouse to select which word they predict would appear next is FAR more threatening than a system that takes in words directly as input and returns a word as output. That would be an indication that AI’s were on the brink of outperforming humans, not just at the task of predicting tokens, but at a very wide range of tasks. I agree this is not happening yet, and I agree that the distinction matters between this paragraph and my claim above.
I haven’t answered your claim about the subconscious abilities of humans to predict text better than this game would indicate because I’m really not sure about whether that’s true or not—not in a “I’ve seen the evidence and it could go either way” kind of way, but in a “I’ve never even thought about it” kind of way. So I’ve avoided engaging with that part of the argument—I don’t think it’s load-bearing for the parts I’ve been discussing in this post, but please let me know if I’m wrong.
Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
(I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.
Expressed differently, my point is that this isn’t even a comparison of two systems on the same task! The LM is directly connected to the text stream and is trained as a system with that dataflow. The human is performing a completely different, much more complex task involving vision and motor subtasks in addition to the linguistic subtask, and then is tested on zero/few shot performance on this new integrated task with little/insignificant transfer training.
If you want to make any valid perf comparison claims you need to compare systems on the exact same full system task. To do that you’d need to take the ANN LM and hook it up to a (virtual or real) robot where the input is the same rendered video display the human receives, and the output is robotic motor commands to manipulate a mouse to play the software version of this text prediction game. Do you really think the ANN LM would beat humans in this true apples to apples comparison?
What? Not only is that not what I”m arguing, the opposite is true! Humans are obviously more sample efficient than current ANN LMs—by a few OOM (humans exceed LM capability after less than 1B tokens equivalent, roughly speaking). Human learning is far more sample efficient on all the actually ecologically/economically important downstream linguistic tasks, so it stands to reason that the linguistic brain regions probably also are more sample efficient at sequence prediction—although that remains untested. Again, this technique does not actually measure the true linguistic prediction ability of the cortex.
At what? Again, in this example the humans and the LM are not even remotely performing the same task. The human task in this example is an enormously more complex computational few shot learning task. We can theorize about more complex agentic systems that could perform the actual exact same task robotic/visual task here (perhaps gato?), but that isn’t what this post discusses.
This isn’t a useful task—there is no inherent value in next-token prediction. It’s just a useful proxy training measure. Human brains are likely using the same unsupervised proxy internally, but actually measuring that directly would involve some complex and invasive neural interfaces (perhaps some future version of neuralink).
But that’s unnecessary because instead we can roughly estimate human brain performance based via comparison on all the various downstream actually economically/ecologically useful tasks such as reading comprehension, math/linguistic problems, QA tasks, story writing, etc. And the general conclusion is the human brain is very roughly an OOM or two more ‘compute’ efficient and several OOM more data efficient than today’s LMs. (for now)
I think I should have emphasised the word “against” in the sentence of mine you quoted:
I feel like the implied conclusion you’re arguing against here is something like “LM’s are more efficient per sample than the human brain at predicting language”,
You replied with: “What? Not only is that not what I”m arguing, the opposite is true!” which was precisely what I was saying. The conclusion you’re arguing against is that LM’s are more sample-efficient than humans. This would require you to take the opposite stance—that humans are more sample-efficient than LM’s. This is the exact stance I believed you were taking. I then went on to say that the text did not rely on this assumption, and therefore your argument, while correct, did not affect the post’s conclusions.
I agree with you that humans are much more sample-efficient than LM’s. I have no compute comparison for human brains and ML models so I’ll take your word on compute efficiency. And I agree that the task the human is doing is more complicated. Humans would dominate modern ML systems if you limited those systems to the data, compute, and sensory inputs that humans get.
I think our major crux is that I don’t see this as particularly important. It’s not a fair head-to-head comparison, but it’s never going to be in the real world. What I personally care about is what these machines are capable of doing, not how efficient they are when doing it or what sensory requirements they have to bypass. If a machine can do a task better than a human can, it doesn’t matter if it’s a fair comparison, provided we can replicate and/or scale this machine to perform the task in the real world. Efficiency matters since it determines cost and speed, but then the relevant factor is “Is this sufficiently fast / cost-effective to use in the real world”, not “How does efficiency compare to the human brain”.
Put it this way: calculators don’t have to use neurons to perform mathematics, and you can put numbers into them directly instead of them having to hear or read them. So it’s not really a valid comparison to say calculators are directly superhuman at arithmetic. And yet, that doesn’t stop us at all from using calculators to perform such calculations much faster and more accurately than any human, because we don’t actually need to restrict computers to human senses. So, why does it matter that a calculator would lose to a human in a “valid” arithmetic contest? How is that more important than what the calculator can actually do under normal calculator-use conditions?
Ahh I apparently missed that key word ‘against’.
No, I don’t strongly disagree there, the crux is perhaps more fundamental.
Earlier you said:
At what?
The post title says “Language models seem to be much better than humans at next-token prediction”. I claim this is most likely false, and you certainly can’t learn much anything about true relative human vs LM next-token prediction ability by evaluating the LM on task A compared to evaluating the human on vastly more complex and completely different task B.
Furthermore, I claim that humans are actually probably somewhere between on par to much better than current LMs at next-token prediction. Next-token prediction is only useful as a proxy unsupervised training task which then enables performance on actually relevant downstream linguistic tasks (reading, writing, math, etc). Actual human performance on those downstream tasks currently dominates LM performance, even when LMs use far more compute and essentially all the world’s data—still not quite enough to beat humans (yet!).
Once again, to correctly directly evaluate human next-token prediction ability would probably require a very sophisticated neural interface for high frequency whole brain recording. Lacking that, our best way to estimate true human next-token prediction perplexity is by measuring human performance on the various downstream tasks which emerge from that foundational prediction ability.
What it feels like to me here is that we’re both arguing different sides, and yet if you asked both of us about any empirical fact we expect to see using current or near-future technology, such as “Would a human achieve a better or worse score on a next-token prediction task under X conditions”, we would both agree with each other.
Thus, it feels to me, upon reflection, that our argument isn’t particularly important, unless we could identify an actual prediction that matters where we would each differ. What would it imply, to say that humans are better at next-token prediction ability than this test predicts, compared to the world where that’s not the case and the human subconscious is no better at word-prediction than our conscious mind has access too?
Again you said earlier that ” in the real world, LM’s outperform humans.”
We disagree on that, and I’m still not sure why as you haven’t addressed my crux or related claims. So if in on reflection, you agree that LM’s don’t actually outperform humans (yet), then sure we agree. Do you believe that LM’s generally outperform humans at important real world linguistic skills such as reading, writing, math, etc?
I’m not even sure what you mean by ‘next-token prediction task’ - do you mean the language modelling game described in this post? As again one of my core points is that LMs score essentially 0 on that game, as they can’t even play it. I suspect you could probably train GATO or some similar multi-modal agent to play it, but that wasn’t tested here.
A great deal. If this game was an accurate test of human brain next-token prediction ability, then either humans wouldn’t have the language skills we are using now, or much of modern neuroscience & DL would be wrong as the brain would be somehow learning complex linguistic skills without bootstrapping from unsupervised sequence prediction.
Since LMs actually score zero on this language modelling game, it also ‘predicts’ that they have zero next token prediction ability.
I think the reason we appear to disagree here is that we’re both using different measurements of “outperform”.
My understanding is that Jacob!outperform means to win in a contest where all other variables are the same—thus, you can’t say that LM’s outperform humans when they don’t need to apply the visual and motor skills humans have to do. The interfaces aren’t the same, so the contest is not fair. If I score higher than you in a tenpin bowling match where I have the safety rails up and you don’t, we can’t say I’ve outperformed you in tenpin bowling.
Jay!outperform means to do better on a metric (such as “How often can you select the next word”) where each side is using an interface suited for them that would correlate with the ability to perform the task on a wide range of inputs. That is to say—it’s fine for the computer to cheat, so long as that cheating doesn’t prevent it from completing the task out of distribution, or the distribution is wide enough to handle anything a user is likely to want from the program in a commercial setting. If the AI was only trained on a small corpus and learned to simply memorise the entire corpus, that wouldn’t count as outperforming because the AI would fall apart if we tried to use it on any other text. But since the task we want to check is text prediction, not visual input or robotics, it doesn’t matter that the AI doesn’t have to see the words.
Both these definitions have their place. Saying “AI can Jacob!outperform humans at this task” would mean the AI was closer to AGI than if it can Jay!outperform humans at a task. I can also see how Jacob!outperforming would be possible for a truly general intelligence. However, saying “AI can Jay!outperform humans at this task” is sufficient for AI to begin replacing humans at that task if that task is valuable. (I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think I may have also misled you when I said “in the real world”, and I apologise for that. What I meant by “in the real world” was something like “In practice, the AI can be used reliably in a way that does better than humans”. Again, the calculator is a good metaphor here—we can reliably use calculators to do more accurate arithmetic than humans for actual problems that humans face every day. I understand how “in the real world” could be read as more like “embodied in a robotic form, interacting with the environment like humans do.” Clearly a calculator can’t do that and never will. I agree LM’s cannot currently do that, and we have no indication that any ML system can currently do so at that level.
So, in summary, here is what I am saying:
If an AI can score higher than a human, using any sort of interface that still allows it to be reliably deployed on a range of data that humans want the task performed in, it can be used for that purpose. This can happen whether the AI is able to utilise human senses or not. If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
That said, I think I understand your point better now. A system that could walk across the room in a physical body, turn on a computer, log on to redwoodresearch.com/next_word_prediction_url, view words on a screen, and click a mouse to select which word they predict would appear next is FAR more threatening than a system that takes in words directly as input and returns a word as output. That would be an indication that AI’s were on the brink of outperforming humans, not just at the task of predicting tokens, but at a very wide range of tasks. I agree this is not happening yet, and I agree that the distinction matters between this paragraph and my claim above.
I haven’t answered your claim about the subconscious abilities of humans to predict text better than this game would indicate because I’m really not sure about whether that’s true or not—not in a “I’ve seen the evidence and it could go either way” kind of way, but in a “I’ve never even thought about it” kind of way. So I’ve avoided engaging with that part of the argument—I don’t think it’s load-bearing for the parts I’ve been discussing in this post, but please let me know if I’m wrong.
Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.