Thanks for posting this, it was really interesting. Some very dumb questions from someone who doesn’t understand ML at all:
1. All of the loss numbers in this post “feel” very close together, and close to the minimum loss of 1.69. Does loss only make sense on a very small scale (like from 1.69 to 2.2), or is this telling us that language models are very close to optimal and there are only minimal remaining possible gains? What was the loss of GPT-1?
2. Humans “feel” better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
3. Does this disprove claims that “scale is all you need” for AI, since we’ve already maxed out scale, or are those claims talking about something different?
Loss values are useful for comparing different models, but I don’t recommend trying to interpret what they “mean” in an absolute sense. There are various reasons for this.
One is that the “conversion rate” between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model’s progress looks like realizing “huh, the word ‘the’ is more common than some other words”, these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it’s already made most of the loss progress it’s going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
(2)
No one really knows, but my money is on “humans are actually better at this through some currently-unknown mechanism,” as opposed to “humans are actually bad at this exact thing.”
Why do I think this?
Well, the reason we’re here talking about this at all is that LMs do write text of spookily high quality, even if they aren’t as good as humans at it. That wasn’t always true. Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require. And then, because scaling up N really does help, you get a model good enough that you and I are here talking about it.
That is, I don’t think transformers are the best you can do at language acquisition. I suspect humans are doing something better that we don’t understand yet. But transformers are easy to scale up really big, and in ML it’s usually possible for sheer size to compensate for using a suboptimal architecture.
(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling—which might mean “humans are actually bad at this exact thing,” but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans. That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.)
(3)
This is sort of a semantic issue.
“Scaling” was always a broader concept that just scaling in model size. In this post and the paper, we’re talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too. The two types of scaling look similar in equations.
So “data scale” is a kind of scale, and always has been.
On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one—model size was what mattered practically, and the data we currently have would be enough for a long time.
People started to conflate “scaling” and “scaling in model size,” because we thought the OpenAI/Kaplan result meant these were the same thing in practice. The way the “scale is all you need” meme gets used, it has this assumption kind of baked in.
There are some things that “scaling enthusiasts” were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models. But, if we can get much larger-scale data, we may be able to just switch over to a “data scaling world” that, in most respects, looks like the world the “parameter scaling world” that the scaling enthusiasts imagined.
That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.
Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.
Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
In some ways, the models’ ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
2. Humans “feel” better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we’ve even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we’re teaching a child to say “dog” for the first time they’ve probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they’re already effectively pretrained on animals before we first provide a supervised connection between the sound “dog” and the sight of a four-limbed hairy creature with long ears on a leash.
I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we’re adults, even if it’s extremely messy.
I did some calculations with a bunch of assumptions and simplifications but here’s a high estimate, back of the envelope calculation for the data and “tokens” a 30 year old human would have “trained” on:
Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/s = 162.5 MB/s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes
Auditory data: Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/s = .18 MB/s over 30 years = .167 Petabytes
Tactile data: 4 million touch receptors providing 8 bits/s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/s over 30 years = 4.73 Petabytes
Olfactory data: We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/s = 0.0000050 MB/s over 30 years = .000004 Petabytes
Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/s = 0.00375 MB/s over 30 years = .00035 Petabytes
This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) this amounts to 31,728 T tokens
This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound.
Here’s the google sheet if anyone wants to copy it or contribute
There’s a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that’s something like 2 kilobyte per second.
Inputs to the human brain are probably dominated by vision. I’m not sure how many bytes per second we see, but I don’t think it’s many orders of magnitudes higher than 2kb.
(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)
This random article (which I haven’t fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that’s like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.
(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)
Thanks for posting this, it was really interesting. Some very dumb questions from someone who doesn’t understand ML at all:
1. All of the loss numbers in this post “feel” very close together, and close to the minimum loss of 1.69. Does loss only make sense on a very small scale (like from 1.69 to 2.2), or is this telling us that language models are very close to optimal and there are only minimal remaining possible gains? What was the loss of GPT-1?
2. Humans “feel” better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
3. Does this disprove claims that “scale is all you need” for AI, since we’ve already maxed out scale, or are those claims talking about something different?
(1)
Loss values are useful for comparing different models, but I don’t recommend trying to interpret what they “mean” in an absolute sense. There are various reasons for this.
One is that the “conversion rate” between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model’s progress looks like realizing “huh, the word ‘the’ is more common than some other words”, these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it’s already made most of the loss progress it’s going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
(2)
No one really knows, but my money is on “humans are actually better at this through some currently-unknown mechanism,” as opposed to “humans are actually bad at this exact thing.”
Why do I think this?
Well, the reason we’re here talking about this at all is that LMs do write text of spookily high quality, even if they aren’t as good as humans at it. That wasn’t always true. Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require. And then, because scaling up N really does help, you get a model good enough that you and I are here talking about it.
That is, I don’t think transformers are the best you can do at language acquisition. I suspect humans are doing something better that we don’t understand yet. But transformers are easy to scale up really big, and in ML it’s usually possible for sheer size to compensate for using a suboptimal architecture.
(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling—which might mean “humans are actually bad at this exact thing,” but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans. That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.)
(3)
This is sort of a semantic issue.
“Scaling” was always a broader concept that just scaling in model size. In this post and the paper, we’re talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too. The two types of scaling look similar in equations.
So “data scale” is a kind of scale, and always has been.
On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one—model size was what mattered practically, and the data we currently have would be enough for a long time.
People started to conflate “scaling” and “scaling in model size,” because we thought the OpenAI/Kaplan result meant these were the same thing in practice. The way the “scale is all you need” meme gets used, it has this assumption kind of baked in.
There are some things that “scaling enthusiasts” were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models. But, if we can get much larger-scale data, we may be able to just switch over to a “data scaling world” that, in most respects, looks like the world the “parameter scaling world” that the scaling enthusiasts imagined.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).
EDIT: These results are now posted here.
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
The human brain internally is performing very similar computations to transformer LLMs—as expected from all the prior research indicating strong similarity between DL vision features and primate vision—but that doesn’t mean we can immediately extract those outputs and apply them towards game performance.
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
I think I remember seeing somewhere that LLMs learn more slowly on languages with “more complex” grammar (in the sense of their loss decreasing more slowly per the same number of tokens) but I can’t find the source right now.
Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)
You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren’t following it. It’s an attempt to train an RNN like a transformer. Initial numbers look pretty good.
A few points:
Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
In some ways, the models’ ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
RNNs are much worse than transformers at in-context learning. It’s not just a difference in generative text quality. See this study by DeepMind: https://twitter.com/FelixHill84/status/1524352818261499911
I’m curious about where you get that “models trained mostly on English text are still pretty good at Spanish” do you have a reference?
Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we’ve even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we’re teaching a child to say “dog” for the first time they’ve probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they’re already effectively pretrained on animals before we first provide a supervised connection between the sound “dog” and the sight of a four-limbed hairy creature with long ears on a leash.
I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we’re adults, even if it’s extremely messy.
I did some calculations with a bunch of assumptions and simplifications but here’s a high estimate, back of the envelope calculation for the data and “tokens” a 30 year old human would have “trained” on:
Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/s = 162.5 MB/s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes
Auditory data: Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/s = .18 MB/s over 30 years = .167 Petabytes
Tactile data: 4 million touch receptors providing 8 bits/s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/s over 30 years = 4.73 Petabytes
Olfactory data: We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/s = 0.0000050 MB/s over 30 years = .000004 Petabytes
Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/s = 0.00375 MB/s over 30 years = .00035 Petabytes
This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) this amounts to 31,728 T tokens
This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound.
Here’s the google sheet if anyone wants to copy it or contribute
There’s a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that’s something like 2 kilobyte per second.
Inputs to the human brain are probably dominated by vision. I’m not sure how many bytes per second we see, but I don’t think it’s many orders of magnitudes higher than 2kb.
That depends a lot on how you count. A quick Googling suggest that the optic nerve has 1.7 million nerve fibers.
If you think about a neuron firing rate of 20 hz that gives you 34 MB per second.
(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)
This random article (which I haven’t fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that’s like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.
(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)