IIRC Redwood research investigated human performance on next token prediction, and humans were mostly worse than even small (by current standards) language models?
sounds right, where “worse” here means “higher bit per word at predicting an existing sentence”, a very unnatural metric humans don’t spend significant effort on.
That is actually a natural metric for the brain and close to what the linguistic cortex does internally. The comparison of having a human play a word prediction game and comparing logit scores of that to the native internal logit predictions of an LLM is kinda silly. The real comparison should be between a human playing that game and LLM playing the exact same game in the exact same way (ie asking GPT verbally to predict the logit score of the next word/token), or you should comapre internal low level transformer logit scores to linear readout models from brain neural probes/scans.
IIRC Redwood research investigated human performance on next token prediction, and humans were mostly worse than even small (by current standards) language models?
sounds right, where “worse” here means “higher bit per word at predicting an existing sentence”, a very unnatural metric humans don’t spend significant effort on.
That is actually a natural metric for the brain and close to what the linguistic cortex does internally. The comparison of having a human play a word prediction game and comparing logit scores of that to the native internal logit predictions of an LLM is kinda silly. The real comparison should be between a human playing that game and LLM playing the exact same game in the exact same way (ie asking GPT verbally to predict the logit score of the next word/token), or you should comapre internal low level transformer logit scores to linear readout models from brain neural probes/scans.
oh interesting point, yeah.