I think the reason we appear to disagree here is that we’re both using different measurements of “outperform”.
My understanding is that Jacob!outperform means to win in a contest where all other variables are the same—thus, you can’t say that LM’s outperform humans when they don’t need to apply the visual and motor skills humans have to do. The interfaces aren’t the same, so the contest is not fair. If I score higher than you in a tenpin bowling match where I have the safety rails up and you don’t, we can’t say I’ve outperformed you in tenpin bowling.
Jay!outperform means to do better on a metric (such as “How often can you select the next word”) where each side is using an interface suited for them that would correlate with the ability to perform the task on a wide range of inputs. That is to say—it’s fine for the computer to cheat, so long as that cheating doesn’t prevent it from completing the task out of distribution, or the distribution is wide enough to handle anything a user is likely to want from the program in a commercial setting. If the AI was only trained on a small corpus and learned to simply memorise the entire corpus, that wouldn’t count as outperforming because the AI would fall apart if we tried to use it on any other text. But since the task we want to check is text prediction, not visual input or robotics, it doesn’t matter that the AI doesn’t have to see the words.
Both these definitions have their place. Saying “AI can Jacob!outperform humans at this task” would mean the AI was closer to AGI than if it can Jay!outperform humans at a task. I can also see how Jacob!outperforming would be possible for a truly general intelligence. However, saying “AI can Jay!outperform humans at this task” is sufficient for AI to begin replacing humans at that task if that task is valuable. (I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think I may have also misled you when I said “in the real world”, and I apologise for that. What I meant by “in the real world” was something like “In practice, the AI can be used reliably in a way that does better than humans”. Again, the calculator is a good metaphor here—we can reliably use calculators to do more accurate arithmetic than humans for actual problems that humans face every day. I understand how “in the real world” could be read as more like “embodied in a robotic form, interacting with the environment like humans do.” Clearly a calculator can’t do that and never will. I agree LM’s cannot currently do that, and we have no indication that any ML system can currently do so at that level.
So, in summary, here is what I am saying:
If an AI can score higher than a human, using any sort of interface that still allows it to be reliably deployed on a range of data that humans want the task performed in, it can be used for that purpose. This can happen whether the AI is able to utilise human senses or not. If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
That said, I think I understand your point better now. A system that could walk across the room in a physical body, turn on a computer, log on to redwoodresearch.com/next_word_prediction_url, view words on a screen, and click a mouse to select which word they predict would appear next is FAR more threatening than a system that takes in words directly as input and returns a word as output. That would be an indication that AI’s were on the brink of outperforming humans, not just at the task of predicting tokens, but at a very wide range of tasks. I agree this is not happening yet, and I agree that the distinction matters between this paragraph and my claim above.
I haven’t answered your claim about the subconscious abilities of humans to predict text better than this game would indicate because I’m really not sure about whether that’s true or not—not in a “I’ve seen the evidence and it could go either way” kind of way, but in a “I’ve never even thought about it” kind of way. So I’ve avoided engaging with that part of the argument—I don’t think it’s load-bearing for the parts I’ve been discussing in this post, but please let me know if I’m wrong.
Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
(I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.
I think the reason we appear to disagree here is that we’re both using different measurements of “outperform”.
My understanding is that Jacob!outperform means to win in a contest where all other variables are the same—thus, you can’t say that LM’s outperform humans when they don’t need to apply the visual and motor skills humans have to do. The interfaces aren’t the same, so the contest is not fair. If I score higher than you in a tenpin bowling match where I have the safety rails up and you don’t, we can’t say I’ve outperformed you in tenpin bowling.
Jay!outperform means to do better on a metric (such as “How often can you select the next word”) where each side is using an interface suited for them that would correlate with the ability to perform the task on a wide range of inputs. That is to say—it’s fine for the computer to cheat, so long as that cheating doesn’t prevent it from completing the task out of distribution, or the distribution is wide enough to handle anything a user is likely to want from the program in a commercial setting. If the AI was only trained on a small corpus and learned to simply memorise the entire corpus, that wouldn’t count as outperforming because the AI would fall apart if we tried to use it on any other text. But since the task we want to check is text prediction, not visual input or robotics, it doesn’t matter that the AI doesn’t have to see the words.
Both these definitions have their place. Saying “AI can Jacob!outperform humans at this task” would mean the AI was closer to AGI than if it can Jay!outperform humans at a task. I can also see how Jacob!outperforming would be possible for a truly general intelligence. However, saying “AI can Jay!outperform humans at this task” is sufficient for AI to begin replacing humans at that task if that task is valuable. (I agree with you that next-token prediction is not itself commercially valuable, whereas something like proofreading would be)
I think I may have also misled you when I said “in the real world”, and I apologise for that. What I meant by “in the real world” was something like “In practice, the AI can be used reliably in a way that does better than humans”. Again, the calculator is a good metaphor here—we can reliably use calculators to do more accurate arithmetic than humans for actual problems that humans face every day. I understand how “in the real world” could be read as more like “embodied in a robotic form, interacting with the environment like humans do.” Clearly a calculator can’t do that and never will. I agree LM’s cannot currently do that, and we have no indication that any ML system can currently do so at that level.
So, in summary, here is what I am saying:
If an AI can score higher than a human, using any sort of interface that still allows it to be reliably deployed on a range of data that humans want the task performed in, it can be used for that purpose. This can happen whether the AI is able to utilise human senses or not. If AI can be reliably used to produce outputs that are in some way better (faster, more accurate, etc.) than humans, it’s not important that the contest is fair—the AI will begin replacing humans at this task anyway.
That said, I think I understand your point better now. A system that could walk across the room in a physical body, turn on a computer, log on to redwoodresearch.com/next_word_prediction_url, view words on a screen, and click a mouse to select which word they predict would appear next is FAR more threatening than a system that takes in words directly as input and returns a word as output. That would be an indication that AI’s were on the brink of outperforming humans, not just at the task of predicting tokens, but at a very wide range of tasks. I agree this is not happening yet, and I agree that the distinction matters between this paragraph and my claim above.
I haven’t answered your claim about the subconscious abilities of humans to predict text better than this game would indicate because I’m really not sure about whether that’s true or not—not in a “I’ve seen the evidence and it could go either way” kind of way, but in a “I’ve never even thought about it” kind of way. So I’ve avoided engaging with that part of the argument—I don’t think it’s load-bearing for the parts I’ve been discussing in this post, but please let me know if I’m wrong.
Note that my notion of outperform here is exactly the same notion that one would use when comparing different ML systems. They obviously need to be performing exactly the same task—in general you can’t claim ML system X is better than ML system Y when X is wired directly to the output and Y has to perform a more complex motor visual task than includes X’s task as a subtask. That would be—ridiculous, frankly.
The issue is that nobody cares about this specific task directly. I’m pretty sure you don’t even care about this specific task directly. The only interest in this task is as some sort of proxy for performance on actually important downstream tasks (reading, writing, math etc). And that’s why it’s so misleading to draw conclusions from ML system performance on proxy task A and human performance on (more complex) proxy task B.
I think you would agree that humans are enormously better at proofreading then implied by this weird comparison (on two different proxy tasks). Obvious counter-example proof: there are human who haven’t played this specific game yet and would score terribly but are good at proofreading.