Loss values are useful for comparing different models, but I don’t recommend trying to interpret what they “mean” in an absolute sense. There are various reasons for this.
One is that the “conversion rate” between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model’s progress looks like realizing “huh, the word ‘the’ is more common than some other words”, these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it’s already made most of the loss progress it’s going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
(2)
No one really knows, but my money is on “humans are actually better at this through some currently-unknown mechanism,” as opposed to “humans are actually bad at this exact thing.”
Why do I think this?
Well, the reason we’re here talking about this at all is that LMs do write text of spookily high quality, even if they aren’t as good as humans at it. That wasn’t always true. Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require. And then, because scaling up N really does help, you get a model good enough that you and I are here talking about it.
That is, I don’t think transformers are the best you can do at language acquisition. I suspect humans are doing something better that we don’t understand yet. But transformers are easy to scale up really big, and in ML it’s usually possible for sheer size to compensate for using a suboptimal architecture.
(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling—which might mean “humans are actually bad at this exact thing,” but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans. That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.)
(3)
This is sort of a semantic issue.
“Scaling” was always a broader concept that just scaling in model size. In this post and the paper, we’re talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too. The two types of scaling look similar in equations.
So “data scale” is a kind of scale, and always has been.
On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one—model size was what mattered practically, and the data we currently have would be enough for a long time.
People started to conflate “scaling” and “scaling in model size,” because we thought the OpenAI/Kaplan result meant these were the same thing in practice. The way the “scale is all you need” meme gets used, it has this assumption kind of baked in.
There are some things that “scaling enthusiasts” were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models. But, if we can get much larger-scale data, we may be able to just switch over to a “data scaling world” that, in most respects, looks like the world the “parameter scaling world” that the scaling enthusiasts imagined.
That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.
Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.
Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
In some ways, the models’ ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
(1)
Loss values are useful for comparing different models, but I don’t recommend trying to interpret what they “mean” in an absolute sense. There are various reasons for this.
One is that the “conversion rate” between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model’s progress looks like realizing “huh, the word ‘the’ is more common than some other words”, these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it’s already made most of the loss progress it’s going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
(2)
No one really knows, but my money is on “humans are actually better at this through some currently-unknown mechanism,” as opposed to “humans are actually bad at this exact thing.”
Why do I think this?
Well, the reason we’re here talking about this at all is that LMs do write text of spookily high quality, even if they aren’t as good as humans at it. That wasn’t always true. Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require. And then, because scaling up N really does help, you get a model good enough that you and I are here talking about it.
That is, I don’t think transformers are the best you can do at language acquisition. I suspect humans are doing something better that we don’t understand yet. But transformers are easy to scale up really big, and in ML it’s usually possible for sheer size to compensate for using a suboptimal architecture.
(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling—which might mean “humans are actually bad at this exact thing,” but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans. That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.)
(3)
This is sort of a semantic issue.
“Scaling” was always a broader concept that just scaling in model size. In this post and the paper, we’re talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too. The two types of scaling look similar in equations.
So “data scale” is a kind of scale, and always has been.
On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one—model size was what mattered practically, and the data we currently have would be enough for a long time.
People started to conflate “scaling” and “scaling in model size,” because we thought the OpenAI/Kaplan result meant these were the same thing in practice. The way the “scale is all you need” meme gets used, it has this assumption kind of baked in.
There are some things that “scaling enthusiasts” were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models. But, if we can get much larger-scale data, we may be able to just switch over to a “data scaling world” that, in most respects, looks like the world the “parameter scaling world” that the scaling enthusiasts imagined.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
EDIT: These results are now posted here.
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.
Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.
A few points:
Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
In some ways, the models’ ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
RNNs are much worse than transformers at in-context learning. It’s not just a difference in generative text quality. See this study by DeepMind: https://twitter.com/FelixHill84/status/1524352818261499911
I’m curious about where you get that “models trained mostly on English text are still pretty good at Spanish” do you have a reference?