Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences.
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)
Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)