You can get an idea of a pre-trained GPT-3′s sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for “long-range” tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 “Curie.”
In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.
I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule. I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect. Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere. In other words, it was “helpful” for exact memorization, but harmful for generalization.
I have a wandb report here with some plots of this phenomenon. I’m still not sure whether it’s an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it.
I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs. Especially given that their feature has a relatively tiny maximum dataset size. My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
You can get an idea of a pre-trained GPT-3′s sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for “long-range” tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 “Curie.”
In my experience, doing more than a single epoch is always harmful when finetuning GPT-J.
I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule. I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect. Training beyond the first epoch only helped on text that had been accidentally duplicated between train and val, and was harmful elsewhere. In other words, it was “helpful” for exact memorization, but harmful for generalization.
I have a wandb report here with some plots of this phenomenon. I’m still not sure whether it’s an indication of the sample efficiency associated with the ~6B scale, a quirk of GPT-J specifically, or (less plausibly) a quirk or bug in the codebase used to tune it.
I did this work before OpenAI released their finetuning feature, and was surprised to find them defaulting to 4 epochs. Especially given that their feature has a relatively tiny maximum dataset size. My gut feeling is that 4 epochs is way too many, given a large model and only 2.5M tokens.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
I cannot access your wandb, btw. It seems to be private.
Whoops, fixed.