Thanks for the response! We’ve rewritten the paragraph starting “The limitations detailed...” for clarity.
Some brief responses to your points:
Shannon’s estimate was about a different quantity. [...]
We agree that Shannon was interested in something else—the “true” entropy of English, using an ideal predictor for English. However, as his estimates of entropy used his wife Mary Shannon and Barnard Oliver as substitutes for his ideal predictor, we think it’s still fair to treat this as an estimate of the entropy/perplexity of humans on English text.
As you point out, there’s definitely been a bunchof follow up workwhich findvarious estimates of the entropy/perplexity of human predictors. The Cover and King source you find above does give a higher estimate consistent with our results. Note their estimator shares many of the same pitfalls of our estimator—for example, if subjects aren’t calibrated, they’ll do quite poorly with respect to both the Cover and King estimator and our estimator. We don’t really make any claims that our results are surprising relative to all other results in this area, merely noting that our estimate is inconsistent with perhaps the most widely known one.
Separately, in my opinion, a far better measure of human-level performance at language modeling is the perplexity level at which a human judge can no longer reliably distinguish between a long sequence of generated text and a real sequence of natural language.
The measure you suggest is similar the methodology used by Shen et al 2017 to get a human level perplexity estimate of 12, which we did mention and criticize in our writeup.
We disagree that this measure is better. Our goal here isn’t to compare the quality of Language Models to the quality of human-generated text; we aimed to compare LMs and humans on the metric that LMs were trained on (minimize log loss/perplexity when predicting the next token). As our work shows, Language Models whose output has significantly worse quality than human text (such as GPT-2 small) can still significantly outperform humans on next token prediction. We think it’s interesting that this happens, and speculated a bit more on the takeaways in the conclusion.
We disagree that this measure is better. Our goal here isn’t to compare the quality of Language Models to the quality of human-generated text; we aimed to compare LMs and humans on the metric that LMs were trained on (minimize log loss/perplexity when predicting the next token).
Your measure is great for your stated goal. That said, I feel the measure gives a misleading impression to readers. In particular I’ll point to this paragraph in the conclusion,
Even current large language models are wildly superhuman at language modeling. This is important to remember when you’re doing language model interpretability, because it means that you should expect your model to have a lot of knowledge about text that you don’t have. Chris Olah draws a picture where he talks about the possibility that models become more interpretable as they get to human level, and then become less interpretable again as they become superhuman; the fact that existing LMs are already superhuman (at the task they’re trained on) is worth bearing in mind when considering this graph.
I think it’s misleading to say that language models are “wildly superhuman at language modeling” by any common-sense interpretation of that claim. While the claim is technically true if one simply means that languages do better at the predict-the-next-token task, most people (I’d imagine) would not intuitively imagine that to be the best measure of general performance at language modeling. The reason, fundamentally, is that we are building language models to compete with other humans at the task of writing text, not the task of predicting the next character.
By analogy, if we train a robot to play tennis by training it to emulate human tennis players, I think most people would think that “human level performance” is reached when it can play as well as a human, not when it can predict the next muscle movement of an expert player better than humans, even if predicting the next muscle movement was the task used during training.
Ah, I see your point. That being said, I think calling the task we train our LMs to do (learn a probabilistic model of language) “language modeling” seems quite reasonable to me—in my opinion, it seems far more unreasonable to call “generating high quality output” “language modeling”. For one thing, there are many LM applications that aren’t just “generate high quality text”! There’s a whole class of LMs like BERT that can’t really be used for text generation at all.
One reason we at Redwood care about this result is that we want to interpret modern LMs. As outlined in the linked Chris Olah argument, we might intuitively expect that AIs get more interpretable as they get to human level performance, then less interpretable as their performance becomes more and more superhuman. If LMs were ~human level at the task they are trained on, we might hope that they contain mainly crisp abstractions that humans find useful for next token prediction. However, since even small LMs are superhuman at next token prediction, they probably contain alien abstractions that humans can’t easily understand, which might pose a serious problem for interpretability efforts.
Ah, I see your point. That being said, I think calling the task we train our LMs to do (learn a probabilistic model of language) “language modeling” seems quite reasonable to me—in my opinion, it seems far more unreasonable to call “generating high quality output” “language modeling”.
Note that the main difference between my suggested task and the next-character-prediction task is that I’m suggesting we measure performance over a long time horizon. “Language models” are, formally, probability distributions over sequences of text, not models over next characters within sequences. It is only via a convenient application of the Markov assumption and the chain rule of probability that we use next-character-prediction during training.
The actual task, in the sense of what language models are fundamentallydesigned to perform well on, is to emulate sequences of human text. Thus, it is quite natural to ask when they can perform well on this task. In fact, I remain convinced that it is more natural to ask about performance on the long-sequence task than the next-character-prediction task.
In fact, I remain convinced that it is more natural to ask about performance on the long-sequence task than the next-character-prediction task.
Large language models are also going to be wildly superhuman by long-sequence metrics like “log probability assigned to sequences of Internet text” (in particular because many such metrics are just sums over the next-character versions of the metric, which this post shows LLMs are great at).
LLMs may not be superhuman at other long-sequence tasks like “writing novels” but those seem like very different tasks, just like “playing already-composed music” and “composing new music” are very different tasks for humans.
(To be clear, I think it’s fair to say that what we care about with LLMs is the latter category of stuff, but let’s not call that “language modeling”.)
Large language models are also going to be wildly superhuman by long-sequence metrics like “log probability assigned to sequences of Internet text”
I think this entirely depends on what you mean. There’s a version of the claim here that I think is true, but I think the most important version of it is actually false, and I’ll explain why.
I claim that if you ask a human expert to write an article (even a relatively short one) about a non-trivial topic, their output will have a higher log probability than a SOTA language model, with respect to the “true” distribution of internet articles. That is, if you were given the (entirely hypothetical) true distribution of actual internet articles (including articles that have yet to be written, and the ones that have been written in other parts of the multiverse...), a human expert is probably going to write an article that has a higher log probability of being sampled from this distribution, compared to a SOTA language model.
This claim might sound bizarre at first, because, as you noted “many such metrics are just sums over the next-character versions of the metric, which this post shows LLMs are great at”. But, first maybe think about this claim from first principles: what is the “true” distribution of internet articles? Well, it’s the distribution of actual internet articles that humans write. If a human writes an article, it’s got to have pretty high log-probability, no? Because otherwise, what are we even sampling from?
Now, what you could mean is that instead of measuring the log probability of an article with respect to the true distribution of internet articles, we measure it with respect to the empirical distribution of internet articles. This is in fact what we use to measure the log-probability of next character predictions. But the log probability of this quantity over long sequences will actually be exactly negative infinity, both for the human-written article, and for the model-written article, assuming they’re not just plagiarizing an already-existing article. That is, we aren’t going to find any article in the empirical distribution that matches the articles either the human or the model wrote, so we can’t tell which of the two is better from this information alone.
What you probably mean is that we could build a model of the true distribution of internet articles, and use this model to estimate the log-probability of internet articles. In that case, I agree, a SOTA language model would probably far outperform the human expert, at the task of writing internet articles, as measured by the log-probability given by another model. But, this is a flawed approach, because the model we’re using to estimate the log-probability with respect to the true distribution of internet articles is likely to be biased in favor of the SOTA model, precisely because it doesn’t understand things like long-sequence coherence, unlike the human.
How could we modify this approach to give a better estimate of the performance of a language model at long-sequence prediction? I think that there’s a relatively simple approach that could work.
Namely, we set up a game in which humans try to distinguish between real human texts and generated articles. If the humans can’t reliably distinguish between the two, then the language model being used to generate the articles has attained human-level performance (at least by this measure). This task has nice properties, as there is a simple mathematical connection between prediction ability and ability to discriminate; a good language model that can pass this test will likely only pass it because it is good at coming up with high log-probability articles. And this task also measures the thing we care about that’s missing from the predict-the-next-character task: coherence over long sequences.
But, first maybe think about this claim from first principles: what is the “true” distribution of internet articles? Well, it’s the distribution of actual internet articles that humans write. If a human writes an article, it’s got to have pretty high log-probability, no? Because otherwise, what are we even sampling from?
I continue to think you are making a point about generation / sampling, whereas language modeling is about modeling the distribution as a whole.
Put another way, even if I am better than the LLM at the specific part of the Internet text distribution that is “articles written by Rohin”, that does not mean that I am better than the LLM at modeling the entire distribution of Internet text (long horizon or not).
Even under your proposed game, I don’t think I can get to indistinguishability, if the discriminator is allowed to learn over time (which seems like the version that properly tests language modeling). They’ll notice peculiarities of my writing style and aspects of Internet writing style that I’m not able to mimic very well. If your proposed test was “when can LLMs beat humans at this game” that seems more reasonable (though it still advantages the humans because the judge is a human; similarly I expect that if the judge was an LLM that would advantage the LLMs relative to the humans).
What you probably mean is that we could build a model of the true distribution of internet articles, and use this model to estimate the log-probability of internet articles.
I don’t think this point depends on thinking about having a model of the true distribution of articles that we use to evaluate humans vs LLMs; I’m happy to talk about the true distribution directly (to the extent that a “true distribution” exists). (This gets tricky though because you could say that in the true distribution coherence errors ~never happen and so LLM outputs have zero probability, but this is basically dooming the LLM to never be superhuman by defining the true distribution to be “what humans output”, which seems like a symptom of a bad definition.)
Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences.
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)
Thanks for the response! We’ve rewritten the paragraph starting “The limitations detailed...” for clarity.
Some brief responses to your points:
We agree that Shannon was interested in something else—the “true” entropy of English, using an ideal predictor for English. However, as his estimates of entropy used his wife Mary Shannon and Barnard Oliver as substitutes for his ideal predictor, we think it’s still fair to treat this as an estimate of the entropy/perplexity of humans on English text.
As you point out, there’s definitely been a bunch of follow up work which find various estimates of the entropy/perplexity of human predictors. The Cover and King source you find above does give a higher estimate consistent with our results. Note their estimator shares many of the same pitfalls of our estimator—for example, if subjects aren’t calibrated, they’ll do quite poorly with respect to both the Cover and King estimator and our estimator. We don’t really make any claims that our results are surprising relative to all other results in this area, merely noting that our estimate is inconsistent with perhaps the most widely known one.
The measure you suggest is similar the methodology used by Shen et al 2017 to get a human level perplexity estimate of 12, which we did mention and criticize in our writeup.
We disagree that this measure is better. Our goal here isn’t to compare the quality of Language Models to the quality of human-generated text; we aimed to compare LMs and humans on the metric that LMs were trained on (minimize log loss/perplexity when predicting the next token). As our work shows, Language Models whose output has significantly worse quality than human text (such as GPT-2 small) can still significantly outperform humans on next token prediction. We think it’s interesting that this happens, and speculated a bit more on the takeaways in the conclusion.
Your measure is great for your stated goal. That said, I feel the measure gives a misleading impression to readers. In particular I’ll point to this paragraph in the conclusion,
I think it’s misleading to say that language models are “wildly superhuman at language modeling” by any common-sense interpretation of that claim. While the claim is technically true if one simply means that languages do better at the predict-the-next-token task, most people (I’d imagine) would not intuitively imagine that to be the best measure of general performance at language modeling. The reason, fundamentally, is that we are building language models to compete with other humans at the task of writing text, not the task of predicting the next character.
By analogy, if we train a robot to play tennis by training it to emulate human tennis players, I think most people would think that “human level performance” is reached when it can play as well as a human, not when it can predict the next muscle movement of an expert player better than humans, even if predicting the next muscle movement was the task used during training.
Ah, I see your point. That being said, I think calling the task we train our LMs to do (learn a probabilistic model of language) “language modeling” seems quite reasonable to me—in my opinion, it seems far more unreasonable to call “generating high quality output” “language modeling”. For one thing, there are many LM applications that aren’t just “generate high quality text”! There’s a whole class of LMs like BERT that can’t really be used for text generation at all.
One reason we at Redwood care about this result is that we want to interpret modern LMs. As outlined in the linked Chris Olah argument, we might intuitively expect that AIs get more interpretable as they get to human level performance, then less interpretable as their performance becomes more and more superhuman. If LMs were ~human level at the task they are trained on, we might hope that they contain mainly crisp abstractions that humans find useful for next token prediction. However, since even small LMs are superhuman at next token prediction, they probably contain alien abstractions that humans can’t easily understand, which might pose a serious problem for interpretability efforts.
Note that the main difference between my suggested task and the next-character-prediction task is that I’m suggesting we measure performance over a long time horizon. “Language models” are, formally, probability distributions over sequences of text, not models over next characters within sequences. It is only via a convenient application of the Markov assumption and the chain rule of probability that we use next-character-prediction during training.
The actual task, in the sense of what language models are fundamentally designed to perform well on, is to emulate sequences of human text. Thus, it is quite natural to ask when they can perform well on this task. In fact, I remain convinced that it is more natural to ask about performance on the long-sequence task than the next-character-prediction task.
Large language models are also going to be wildly superhuman by long-sequence metrics like “log probability assigned to sequences of Internet text” (in particular because many such metrics are just sums over the next-character versions of the metric, which this post shows LLMs are great at).
LLMs may not be superhuman at other long-sequence tasks like “writing novels” but those seem like very different tasks, just like “playing already-composed music” and “composing new music” are very different tasks for humans.
(To be clear, I think it’s fair to say that what we care about with LLMs is the latter category of stuff, but let’s not call that “language modeling”.)
I think this entirely depends on what you mean. There’s a version of the claim here that I think is true, but I think the most important version of it is actually false, and I’ll explain why.
I claim that if you ask a human expert to write an article (even a relatively short one) about a non-trivial topic, their output will have a higher log probability than a SOTA language model, with respect to the “true” distribution of internet articles. That is, if you were given the (entirely hypothetical) true distribution of actual internet articles (including articles that have yet to be written, and the ones that have been written in other parts of the multiverse...), a human expert is probably going to write an article that has a higher log probability of being sampled from this distribution, compared to a SOTA language model.
This claim might sound bizarre at first, because, as you noted “many such metrics are just sums over the next-character versions of the metric, which this post shows LLMs are great at”. But, first maybe think about this claim from first principles: what is the “true” distribution of internet articles? Well, it’s the distribution of actual internet articles that humans write. If a human writes an article, it’s got to have pretty high log-probability, no? Because otherwise, what are we even sampling from?
Now, what you could mean is that instead of measuring the log probability of an article with respect to the true distribution of internet articles, we measure it with respect to the empirical distribution of internet articles. This is in fact what we use to measure the log-probability of next character predictions. But the log probability of this quantity over long sequences will actually be exactly negative infinity, both for the human-written article, and for the model-written article, assuming they’re not just plagiarizing an already-existing article. That is, we aren’t going to find any article in the empirical distribution that matches the articles either the human or the model wrote, so we can’t tell which of the two is better from this information alone.
What you probably mean is that we could build a model of the true distribution of internet articles, and use this model to estimate the log-probability of internet articles. In that case, I agree, a SOTA language model would probably far outperform the human expert, at the task of writing internet articles, as measured by the log-probability given by another model. But, this is a flawed approach, because the model we’re using to estimate the log-probability with respect to the true distribution of internet articles is likely to be biased in favor of the SOTA model, precisely because it doesn’t understand things like long-sequence coherence, unlike the human.
How could we modify this approach to give a better estimate of the performance of a language model at long-sequence prediction? I think that there’s a relatively simple approach that could work.
Namely, we set up a game in which humans try to distinguish between real human texts and generated articles. If the humans can’t reliably distinguish between the two, then the language model being used to generate the articles has attained human-level performance (at least by this measure). This task has nice properties, as there is a simple mathematical connection between prediction ability and ability to discriminate; a good language model that can pass this test will likely only pass it because it is good at coming up with high log-probability articles. And this task also measures the thing we care about that’s missing from the predict-the-next-character task: coherence over long sequences.
I continue to think you are making a point about generation / sampling, whereas language modeling is about modeling the distribution as a whole.
Put another way, even if I am better than the LLM at the specific part of the Internet text distribution that is “articles written by Rohin”, that does not mean that I am better than the LLM at modeling the entire distribution of Internet text (long horizon or not).
Even under your proposed game, I don’t think I can get to indistinguishability, if the discriminator is allowed to learn over time (which seems like the version that properly tests language modeling). They’ll notice peculiarities of my writing style and aspects of Internet writing style that I’m not able to mimic very well. If your proposed test was “when can LLMs beat humans at this game” that seems more reasonable (though it still advantages the humans because the judge is a human; similarly I expect that if the judge was an LLM that would advantage the LLMs relative to the humans).
I don’t think this point depends on thinking about having a model of the true distribution of articles that we use to evaluate humans vs LLMs; I’m happy to talk about the true distribution directly (to the extent that a “true distribution” exists). (This gets tricky though because you could say that in the true distribution coherence errors ~never happen and so LLM outputs have zero probability, but this is basically dooming the LLM to never be superhuman by defining the true distribution to be “what humans output”, which seems like a symptom of a bad definition.)
Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
If I had to pick a claim to disagree with it would be the one about the “true distribution”, but it’s less that I disagree with the claim, and more that I disagree with using this particular method for declaring whether AIs are superhuman at language modeling.
But a model can never be better than me at this task! Why are we interested in it?
This really feels like the core issue: if you define a task as “sample accurately from distribution D”, and you define the distribution D as “whatever is produced by X”, then X is definitionally optimal at the task, and no AI system is ever going to be super-X.
(Even once the AI system can fool humans into thinking its text is human-generated, there could still something inhuman about it that the humans fail to pick up on, that means that its generations are near-zero probability.)
In the language modeling case it isn’t quite this perverse (because we the distribution is “whatever is produced by all humans” whereas we’re only asking the AI to be better than one human) but it still seems pretty perverse.
I think a better evaluation of human ability at language modeling would be (1) sampling a sequence from the empirical dataset, (2) sampling a sequence from a human, (3) evaluate how often these two are the same, (4) do some math to turn this into a perplexity. This would have huge variance (obviously most of the time the sequences won’t be the same for step (3)), but you can reduce the variance by sampling continuations for a given prompt rather than sampling sequences unprompted, and reduce it further by looking just at sampling the next token given a prompt (at which point you are at the scheme used in this post).
On my view, this is just the sum of log probs on next tokens, so presumably the result from this post implies that the language model will be way better than humans (while still assigning very low probability to a given sequence, just because the sequence is long, similarly to how you would assign very low probability to any given long sequence of coin flips).
However, I’m still not sure how exactly you are thinking of “the probability assigned by the human”—it seems like you don’t like the methodology of this post, but if not that, I don’t see how else you would elicit a probability distribution over tokens / words / sequences from humans, and so I’m not really sure how to ground this claim.
(Whereas I think I do understand your claims about what kinds of text humans tend to generate.)