4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same ‘cognitive/linguistic power’ but that since GPT-2′s cognition is spread over a much wider data range than a human, it’s ‘linguistic power density’ is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.
I think LLMs are great and plausibly superhuman at language, it’s just that we don’t want them to do language, we want them to do useful real-world tasks, and hijacking a language model to do useful real-world tasks is hilariously inefficient.
If you consider pure language tasks like “Here’s some information in format X, please reshuffle it to the equivalent in format Y”, then GPT-4 seems vastly superhuman. (I’m somewhat abusing terms here since the “language” task of reshuffling information is somewhat different than the “language” task of autoregressively predicting information, but I think they are probably way more closely related tasks than if you want to apply it to something useful? Idk.) Can’t remember anything about how good GPT-2 was at this, not sure I even bothered to try it.
IIRC Redwood research investigated human performance on next token prediction, and humans were mostly worse than even small (by current standards) language models?
sounds right, where “worse” here means “higher bit per word at predicting an existing sentence”, a very unnatural metric humans don’t spend significant effort on.
That is actually a natural metric for the brain and close to what the linguistic cortex does internally. The comparison of having a human play a word prediction game and comparing logit scores of that to the native internal logit predictions of an LLM is kinda silly. The real comparison should be between a human playing that game and LLM playing the exact same game in the exact same way (ie asking GPT verbally to predict the logit score of the next word/token), or you should comapre internal low level transformer logit scores to linear readout models from brain neural probes/scans.
I think LLMs are great and plausibly superhuman at language
I think the problem might be that “language” encompasses a much broader variety of tasks than image generation. For example, generating poetry with a particular rhyming structure or meter seems to be a pretty “pure” language task, yet even GPT-4 struggles with it. Meanwhile, diffusion models with a quarter of the parameter count of GPT-4 can output art in a dizzying variety of styles, from Raphael-like neoclassical realism to Picasso-like cubism.
I think LLMs are great and plausibly superhuman at language, it’s just that we don’t want them to do language, we want them to do useful real-world tasks, and hijacking a language model to do useful real-world tasks is hilariously inefficient.
If you consider pure language tasks like “Here’s some information in format X, please reshuffle it to the equivalent in format Y”, then GPT-4 seems vastly superhuman. (I’m somewhat abusing terms here since the “language” task of reshuffling information is somewhat different than the “language” task of autoregressively predicting information, but I think they are probably way more closely related tasks than if you want to apply it to something useful? Idk.) Can’t remember anything about how good GPT-2 was at this, not sure I even bothered to try it.
IIRC Redwood research investigated human performance on next token prediction, and humans were mostly worse than even small (by current standards) language models?
sounds right, where “worse” here means “higher bit per word at predicting an existing sentence”, a very unnatural metric humans don’t spend significant effort on.
That is actually a natural metric for the brain and close to what the linguistic cortex does internally. The comparison of having a human play a word prediction game and comparing logit scores of that to the native internal logit predictions of an LLM is kinda silly. The real comparison should be between a human playing that game and LLM playing the exact same game in the exact same way (ie asking GPT verbally to predict the logit score of the next word/token), or you should comapre internal low level transformer logit scores to linear readout models from brain neural probes/scans.
oh interesting point, yeah.
I think the problem might be that “language” encompasses a much broader variety of tasks than image generation. For example, generating poetry with a particular rhyming structure or meter seems to be a pretty “pure” language task, yet even GPT-4 struggles with it. Meanwhile, diffusion models with a quarter of the parameter count of GPT-4 can output art in a dizzying variety of styles, from Raphael-like neoclassical realism to Picasso-like cubism.