The usual argument against this being a big deal is “to predict the next token well, you must have an accurate model of the world”, but so far it does not seem to be the case, as I understand it.
I’d guess he’s thinking of the observation that when tried, humans seem a lot worse at next-token prediction than even a GPT-3 model. This raises questions about the next-token logic: why doesn’t superhuman next-token prediction then produce superhuman intelligence?
However, I don’t think that necessarily works: the original logic is correct, it is clearly sufficient to be an accurate next-token predictor in at least some next-token scenarios like a dataset constructed to include only the most difficult multiple-choice problems (eg. GPQA). Because then you can simply pose all tasks in the form of the multiple-choice question and by definition, it will perform as well as the humans. But it does not, and the benchmark scores are still sub-human, so we know almost by definition that if we asked humans for the probability of the next-token where next-token=answer, they would predict better. Note that we didn’t say, “random Internet text” but “only the most difficult problems”. The next-token argument doesn’t work for average text. Humans predict worse on average text, but predict better on some important subsets of text.
The models are clearly subhuman on many text benchmarks, even though that is still ‘just’ next-token prediction of the answer-completions. It is also the case that, AFAIK, we have no benchmarks of comparing human predictions on much longer passages—the GPT-2 model may beat you if you have to predict the next token, but you can easily beat it if you are given several instances of the next 100 tokens and asked to predict which one is more likely. How can it beat us on average at predicting a random next token, yet lose to us at predicting many next tokens? (“We lose money on each unit we sell, but don’t worry, we’ll make it up on volume!”)
What this is telling us is that the model appears to be ‘cheating’ by winning a lot of predictive edge over unimportant tokens, even though its errors accumulate and it fails to predict key tokens. the correct comparison can’t be ‘the average Internet next-token’. It has to be specific key ‘golden’ tokens, which are analogous to the choice ‘a’/‘b’/‘c’/‘d’ of answering a multiple choice question: you can predict every token up to that, but if you aren’t genuinely understanding, you can’t predict the final one of ‘a’ rather than ‘d’. (Or my old example of a murder mystery—thousands and thousands of tokens which must be analyzed deeply in order to predict the final handful of tokens which complete the text “And the murderer is - !”.) A model mimicks the easy tokens flawlessly, but then once it hits a critical junction point, it goes off the rails, and then the human chugs along past it. In a benchmark dataset, those junction points come up regularly and are indeed the entire point, while during random Internet texts, there might be zero such points, depending on how repetitive or superficial or mundane the text is.
So why does training on low-quality average tokens demonstrably work even though the models are superhuman at that, and the token prediction argument is inapplicable to such tokens? Well, that’s a good question.
The easiest answer (drawing on active learning / experiment design / reinforcement learning / coreset / machine teaching observations about optimal sample-efficiency & how far away LLMs are in pretraining from what seems like human sample-efficiency) is that the models have such large capacities that they can learn all the superficial stuff that humans have not which are useful for predicting the average next-token but do not themselves elicit the deep capabilities we want; it is then the occasional ‘gold’ token which very very gradually forces the model to learn those too. So a model is brrring through vast reams of Internet text, successfully memorizing every meme or stylistic tic or spammer text over millions of tokens, and once in a while, someone says something actually meaningful to predict like “I put my ice cream in the microwave and then it ______” and it makes a mistake in predicting “melted” and learns a bit about real-world physics and commonsense, and then goes back to the memorization. There is, I think, a good deal of evidence for this. (And this predicts, among other things, that it should be possible to train models of great intelligence with many OOMs less data than we do now.)
I wonder if giving lower rewards for correctly guessing common tokens, and higher rewards for correctly guessing uncommon tokens would improve models? I don’t think I’ve seen anyone trying this.
It’s not obvious that ‘uncommon’ tokens are good or that that’s a good approach.
They could also just be unlikely or garbage, and your screening method for filtering for ‘uncommon’ tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the ‘mammogram screening problem’: even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you’re trying to scrape rare languages off the general Internet.)
Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my ‘microwave’ example.
Why does that not seem to be the case to you?
I’d guess he’s thinking of the observation that when tried, humans seem a lot worse at next-token prediction than even a GPT-3 model. This raises questions about the next-token logic: why doesn’t superhuman next-token prediction then produce superhuman intelligence?
(Background)
However, I don’t think that necessarily works: the original logic is correct, it is clearly sufficient to be an accurate next-token predictor in at least some next-token scenarios like a dataset constructed to include only the most difficult multiple-choice problems (eg. GPQA). Because then you can simply pose all tasks in the form of the multiple-choice question and by definition, it will perform as well as the humans. But it does not, and the benchmark scores are still sub-human, so we know almost by definition that if we asked humans for the probability of the next-token where next-token=answer, they would predict better. Note that we didn’t say, “random Internet text” but “only the most difficult problems”. The next-token argument doesn’t work for average text. Humans predict worse on average text, but predict better on some important subsets of text.
The models are clearly subhuman on many text benchmarks, even though that is still ‘just’ next-token prediction of the answer-completions. It is also the case that, AFAIK, we have no benchmarks of comparing human predictions on much longer passages—the GPT-2 model may beat you if you have to predict the next token, but you can easily beat it if you are given several instances of the next 100 tokens and asked to predict which one is more likely. How can it beat us on average at predicting a random next token, yet lose to us at predicting many next tokens? (“We lose money on each unit we sell, but don’t worry, we’ll make it up on volume!”)
What this is telling us is that the model appears to be ‘cheating’ by winning a lot of predictive edge over unimportant tokens, even though its errors accumulate and it fails to predict key tokens. the correct comparison can’t be ‘the average Internet next-token’. It has to be specific key ‘golden’ tokens, which are analogous to the choice ‘a’/‘b’/‘c’/‘d’ of answering a multiple choice question: you can predict every token up to that, but if you aren’t genuinely understanding, you can’t predict the final one of ‘a’ rather than ‘d’. (Or my old example of a murder mystery—thousands and thousands of tokens which must be analyzed deeply in order to predict the final handful of tokens which complete the text “And the murderer is - !”.) A model mimicks the easy tokens flawlessly, but then once it hits a critical junction point, it goes off the rails, and then the human chugs along past it. In a benchmark dataset, those junction points come up regularly and are indeed the entire point, while during random Internet texts, there might be zero such points, depending on how repetitive or superficial or mundane the text is.
So why does training on low-quality average tokens demonstrably work even though the models are superhuman at that, and the token prediction argument is inapplicable to such tokens? Well, that’s a good question.
The easiest answer (drawing on active learning / experiment design / reinforcement learning / coreset / machine teaching observations about optimal sample-efficiency & how far away LLMs are in pretraining from what seems like human sample-efficiency) is that the models have such large capacities that they can learn all the superficial stuff that humans have not which are useful for predicting the average next-token but do not themselves elicit the deep capabilities we want; it is then the occasional ‘gold’ token which very very gradually forces the model to learn those too. So a model is brrring through vast reams of Internet text, successfully memorizing every meme or stylistic tic or spammer text over millions of tokens, and once in a while, someone says something actually meaningful to predict like “I put my ice cream in the microwave and then it ______” and it makes a mistake in predicting “melted” and learns a bit about real-world physics and commonsense, and then goes back to the memorization. There is, I think, a good deal of evidence for this. (And this predicts, among other things, that it should be possible to train models of great intelligence with many OOMs less data than we do now.)
I wonder if giving lower rewards for correctly guessing common tokens, and higher rewards for correctly guessing uncommon tokens would improve models? I don’t think I’ve seen anyone trying this.
Found: https://ar5iv.labs.arxiv.org/html/1902.09191 - Improving Neural Response Diversity with Frequency-Aware Cross-Entropy Loss .
It’s not obvious that ‘uncommon’ tokens are good or that that’s a good approach.
They could also just be unlikely or garbage, and your screening method for filtering for ‘uncommon’ tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the ‘mammogram screening problem’: even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you’re trying to scrape rare languages off the general Internet.)
Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my ‘microwave’ example.
(Data pruning & active learning are hard.)