I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3′s total sample efficiency for predicting text is poor:
To learn to predict text, GPT-3 has to read >1000x as much text as a human can learn in their lifetime.
To learn to win at go, AlphaGo has to play >100x times as many games as a human could play in their lifetime.
But on-the-margin, it’s very sample efficient at learning to perform new text-related tasks:
GPT-3 can learn to perform a new text-related task as easily as a human can.
Essentially, what’s happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.
Does that resolve the sense of confusion/mystery, or is there more to it that I’m missing?
That does help, thanks. However, now that I understand better what people are saying, I think it’s wrong:
The comparison they are making is as follows:
GPT-3
Human
Pre-trained on 3x10^11 tokens of text
Pre-trained on 3x10^8 tokens of text (fermi estimate based on WMP 300 so maybe 500 tokens per minute, 10 hours per week reading, 52 weeks a year, over 20 years of life)
Able to read a new fact once or twice and then learn it / remember it.
Able to read a new fact once or twice and then learn it / remember it
However, I think this is a bad comparison, because it ignores everything else in the human life that the human has learned from / been pre-trained on. A better comparison would be:
GPT-3
Human
Pre-trained on 3x10^11 tokens of text
Pre-trained on 3x10^8 tokens of text as well as 1.5x10^8 tokens of spoken language and 10^17 pixels of video (conservative estimate assuming about 10 frames per second over 20 year lifespan using at-a-glance pixel count of human vision, ETA: Steve says it’s an order of magnitude less than that if you look at the compressed information the retina sends to the brain, so 10^16) To put it another way: A 20-yr-old human has 10^10 tenths-of-a-second of experience to learn from.
Able to read a new fact once or twice and then learn it / remember it.
Able to read a new fact once or twice and then learn it / remember it
In light of this comparison, which is more appropriate I think, it’s not even clear that humans are more sample-efficient than GPT-3! On balance it seems like they still probably are, but also note that they are 3 OOMs bigger than GPT-3 and we have already established that larger neural nets are more sample-efficient. So… for all we know, the mild human advantage in sample-efficiency could be mainly coming from the increased size.
Strong opinions loosely held. I don’t actually trust this reasoning well enough to put my weight on it. I’m just putting it out there to see what people think.
Your comparison does a disservice to the human’s sample efficiency in two ways:
You’re counting diverse data in the human’s environment, but you’re not comparing their performance on diverse tasks. Human’s are obviously better than GPT3 at interactive tasks, walking around, etc. For either kind of fair comparison text data & task, or diverse data & task, the human has far superior sample efficiency.
“fancy learning techniques” don’t count as data. If the human can get mileage out of them, all the better for the human’s sample efficiency.
So you seem to have it backwards when you say that the comparison that everyone is making is the “bad” one.
Thanks. Hmmm. I agree with #2, and should edit to clarify. I meant “fancy learning techniques that we could also do with our AIs if we wanted,” but maybe I’ll just avoid that can of worms for now.
For #1: We don’t know how well a human-sized artificial neural net would perform if it was trained on the quantity and variety of data that humans have. We haven’t done the experiment yet. However, my point is that for all we know it’s entirely possible that such a neural net would perform at about human level on all the tasks humans do. The people who are saying that modern neural nets are significantly less sample-efficient than humans are committed to denying this. (Or if they aren’t, then I don’t know what we are arguing about anymore?) They are committed to saying that we can extrapolate from e.g. GPT-3′s performance vs. training data to conclude that we’d need something trained a lot longer than a human (on similar-to-human-lifetime data) to reach human performance. One way they might run this argument is to point out that GPT-3 has already seen more text than any human ever. My reply is that if a human had seen as much text as GPT-3, and only text, nothing else they probably would have poor performance as well, certainly on every task that wasn’t a text-based task! Sorry for this oblique response to your point, if it is insufficient I can make a more direct one.
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven’t read the paper though. It’s a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??
I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3′s total sample efficiency for predicting text is poor:
To learn to predict text, GPT-3 has to read >1000x as much text as a human can learn in their lifetime.
To learn to win at go, AlphaGo has to play >100x times as many games as a human could play in their lifetime.
But on-the-margin, it’s very sample efficient at learning to perform new text-related tasks:
GPT-3 can learn to perform a new text-related task as easily as a human can.
Essentially, what’s happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.
Does that resolve the sense of confusion/mystery, or is there more to it that I’m missing?
That does help, thanks. However, now that I understand better what people are saying, I think it’s wrong:
The comparison they are making is as follows:
However, I think this is a bad comparison, because it ignores everything else in the human life that the human has learned from / been pre-trained on. A better comparison would be:
In light of this comparison, which is more appropriate I think, it’s not even clear that humans are more sample-efficient than GPT-3! On balance it seems like they still probably are, but also note that they are 3 OOMs bigger than GPT-3 and we have already established that larger neural nets are more sample-efficient. So… for all we know, the mild human advantage in sample-efficiency could be mainly coming from the increased size.
Strong opinions loosely held. I don’t actually trust this reasoning well enough to put my weight on it. I’m just putting it out there to see what people think.
Your comparison does a disservice to the human’s sample efficiency in two ways:
You’re counting diverse data in the human’s environment, but you’re not comparing their performance on diverse tasks. Human’s are obviously better than GPT3 at interactive tasks, walking around, etc. For either kind of fair comparison text data & task, or diverse data & task, the human has far superior sample efficiency.
“fancy learning techniques” don’t count as data. If the human can get mileage out of them, all the better for the human’s sample efficiency.
So you seem to have it backwards when you say that the comparison that everyone is making is the “bad” one.
Thanks. Hmmm. I agree with #2, and should edit to clarify. I meant “fancy learning techniques that we could also do with our AIs if we wanted,” but maybe I’ll just avoid that can of worms for now.
For #1: We don’t know how well a human-sized artificial neural net would perform if it was trained on the quantity and variety of data that humans have. We haven’t done the experiment yet. However, my point is that for all we know it’s entirely possible that such a neural net would perform at about human level on all the tasks humans do. The people who are saying that modern neural nets are significantly less sample-efficient than humans are committed to denying this. (Or if they aren’t, then I don’t know what we are arguing about anymore?) They are committed to saying that we can extrapolate from e.g. GPT-3′s performance vs. training data to conclude that we’d need something trained a lot longer than a human (on similar-to-human-lifetime data) to reach human performance. One way they might run this argument is to point out that GPT-3 has already seen more text than any human ever. My reply is that if a human had seen as much text as GPT-3, and only text, nothing else they probably would have poor performance as well, certainly on every task that wasn’t a text-based task! Sorry for this oblique response to your point, if it is insufficient I can make a more direct one.
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven’t read the paper though. It’s a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??
OK, nice. Edited to fix.