(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
Elsewhere, based on papers like this and this, various people have extrapolated the following takes:
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they’ll only need to see each data point once. (Search this for “multiple epochs”)
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3′s training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I’m told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I’d be interested to hear people do a bit of a search for the “most sample-efficient/obscure fact” in GPT-3′s repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don’t have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they’ve trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven’t heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is “Paul Christiano.” (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.
The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines “k-eidetic memorization” as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.
I have a related question I’ve been trying to operationalize. How well do GPT-3′s memories “generalize”? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 “gain information” from that fact?
E.g. training: “Ixlthubs live in the water.” Test: does this affect the predicted likelihood of “Ixlthubs live in the Pacific”? What about “Ixlthubs cannot survive on land”? I’d consider this another interesting measure of sample efficiency/generalization performance. I’m attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it’s my first try at something like this and slow going.
This is great, thanks! Then I wonder what people mean, exactly, when they say current methods are sample-inefficient. k=1 memorization seems to be about as good as humans, and this with tiny artificial neural nets! (Even GPT-3 is a thousand times smaller than a human brain).
Your question is super interesting as well. If you make progress on answering it, I’d love to hear!
First pass at trying to answer:
I’m asking GPT-3 questions of the form “Who is X?” to see what it knows. It knows EY, Paul, Katja, Julia, Wei Dai, Kaj Sotala… It thinks Daniel Kokotajlo is a filmmaker, which is true actually (there are two of us in the world, and the more well-known one is the filmmaker). It thinks Evan Hubinger is a software engineer.
In parallel I’m googling those names in quotes to see how many hits they get. To my surprise there is about ten thousand hits for many of these names, the more popular ones get more. But GPT-3′s training data didn’t contain the whole internet, right? Just a fraction of it? So presumably it had only one thousand, or one hundred, instances of each name to learn from?
Slight subtlety—GPT-3 might have a bias in its training data towards things related to AI and things of interest to the internet (maybe they scraped a lot of forums as well as just google). I picked some random names from non-western countries—for example, this Estonian politician gets 33,000 hits on Google and wasn’t recognised by GPT-3. It thought he was a software developer (though from Estonia). Might mean that if you’re estimating sample efficiency from Google search hits on people involved with AI, you’ll end up overestimating sample efficiency.
What did it say about me? :D I think I tried asking the AI Dungeon version about me at some point but apparently the adventure game finetuning had made that knowledge inaccessible.
I don’t remember what it said the first time, but I just asked it now:
Thanks!
LW is sponsored by CFAR so this is kind of correct if you squint a bit
Yeah, I’m counting things as correct if it gets in the right ballpark. Like, I myself didn’t know where you worked exactly, but CFAR sounded plausible, especially as a place you may have worked in the past. The fact that GPT-3 said you work at CFAR means it thinks you are part of the rationalist community, which is pretty impressive IMO.
I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3′s total sample efficiency for predicting text is poor:
To learn to predict text, GPT-3 has to read >1000x as much text as a human can learn in their lifetime.
To learn to win at go, AlphaGo has to play >100x times as many games as a human could play in their lifetime.
But on-the-margin, it’s very sample efficient at learning to perform new text-related tasks:
GPT-3 can learn to perform a new text-related task as easily as a human can.
Essentially, what’s happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.
Does that resolve the sense of confusion/mystery, or is there more to it that I’m missing?
That does help, thanks. However, now that I understand better what people are saying, I think it’s wrong:
The comparison they are making is as follows:
However, I think this is a bad comparison, because it ignores everything else in the human life that the human has learned from / been pre-trained on. A better comparison would be:
In light of this comparison, which is more appropriate I think, it’s not even clear that humans are more sample-efficient than GPT-3! On balance it seems like they still probably are, but also note that they are 3 OOMs bigger than GPT-3 and we have already established that larger neural nets are more sample-efficient. So… for all we know, the mild human advantage in sample-efficiency could be mainly coming from the increased size.
Strong opinions loosely held. I don’t actually trust this reasoning well enough to put my weight on it. I’m just putting it out there to see what people think.
Your comparison does a disservice to the human’s sample efficiency in two ways:
You’re counting diverse data in the human’s environment, but you’re not comparing their performance on diverse tasks. Human’s are obviously better than GPT3 at interactive tasks, walking around, etc. For either kind of fair comparison text data & task, or diverse data & task, the human has far superior sample efficiency.
“fancy learning techniques” don’t count as data. If the human can get mileage out of them, all the better for the human’s sample efficiency.
So you seem to have it backwards when you say that the comparison that everyone is making is the “bad” one.
Thanks. Hmmm. I agree with #2, and should edit to clarify. I meant “fancy learning techniques that we could also do with our AIs if we wanted,” but maybe I’ll just avoid that can of worms for now.
For #1: We don’t know how well a human-sized artificial neural net would perform if it was trained on the quantity and variety of data that humans have. We haven’t done the experiment yet. However, my point is that for all we know it’s entirely possible that such a neural net would perform at about human level on all the tasks humans do. The people who are saying that modern neural nets are significantly less sample-efficient than humans are committed to denying this. (Or if they aren’t, then I don’t know what we are arguing about anymore?) They are committed to saying that we can extrapolate from e.g. GPT-3′s performance vs. training data to conclude that we’d need something trained a lot longer than a human (on similar-to-human-lifetime data) to reach human performance. One way they might run this argument is to point out that GPT-3 has already seen more text than any human ever. My reply is that if a human had seen as much text as GPT-3, and only text, nothing else they probably would have poor performance as well, certainly on every task that wasn’t a text-based task! Sorry for this oblique response to your point, if it is insufficient I can make a more direct one.
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven’t read the paper though. It’s a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??
OK, nice. Edited to fix.
Perhaps GPT-3 has more parameters than are probably needed to roughly memorize its very large training data. This would be good since the data contains some low quality garbage, false claims, etc (can think of them as ‘noise’). I believe GPT-n are adding parameters faster than training data Here’s my summary of a paper that suggests this is the right move:
https://www.youtube.com/watch?v=OzGguadEHOU Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation ‘early stopping’ of training duration or size scaling is a mistake. after you’re over some initial hump that would trigger validation early stopping, overfitting is ‘benign’ [already known, dubbed ‘double descent’]. As soon as you can defeat adversarial attacks then you’re probably using enough parameters. He (+intern) proves that in order to perfectly memorize the label-noised data set such that small perturbations in the noise don’t change predicted output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.
(obviously GPT- are language models but they can be thought of as having an output which is the masked word or the sentence-before-or-after or whatever they’re using to train)
You can get an idea of a pre-trained GPT-3′s sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for “long-range” tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 “Curie.”
In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.
I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule. I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect. Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere. In other words, it was “helpful” for exact memorization, but harmful for generalization.
I have a wandb report here with some plots of this phenomenon. I’m still not sure whether it’s an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it.
I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs. Especially given that their feature has a relatively tiny maximum dataset size. My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
I cannot access your wandb, btw. It seems to be private.
Whoops, fixed.
In general a language model will ‘know’ the sentence related to the single occurrence of a rare name. I don’t think you learn much here if there are enough parameters available to support this memory.
The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can’t isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don’t know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?