I am not confident in that part. I was imagining that they would be “only” 3x bigger or so, but that they’d be trained on much higher-quality data (incl. multimodal) and also trained for longer/more data, since corps would be not optimizing purely for training-compute-optimal performance but instead worrying a bit more about inference-time compute costs. Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under “higher-quality data”). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That’s the power of fine-tuning / data quality.
Also, 3x bigger than GPT-3 is still, like, 40x bigger than Codex, and Codex is pretty impressive. So I expect scale will be contributing some amount to the performance gains for things like code and image and video, albeit not so much for text since GPT-3-175B was already pretty big.
If Google’s multimodal model is already 100B parameters big, then I look forward to seeing its performance! Is it worse than GPT-3? If so, that would be evidence against my forecast, though we still have two years to go...
Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under “higher-quality data”). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That’s the power of fine-tuning / data quality.
Fine-tuning GPT-3 on code had little benefit compared to training from scratch:
Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the finetuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.
I wouldn’t categorize Codex under “benefits of fine-tuning/data quality” but under “benefits of specialization”. That’s because GPT-3 is trained on little code whereas Codex only on code. (And the Codex paper didn’t work on data quality more than the GPT-3 paper.)
Huh.… I coulda’ sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version...
The paper seems to support this interpretation when you add in more context to the quote you pulled:
We fine-tune GPT models containing up to 12B parameters on code to produce Codex. … Since Codex is evaluated on natural language prompts, we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family, which already contains strong natural language representations. Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the fine-tuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments. We train Codex using the same learning rate as the corresponding GPT model, with a 175 step linear warmup and cosine learning rate decay. We train for a total of 100 billion tokens, using the Adam optimizer with 1 = 0:9, 2 = 0:95, = 10 8 , and a weight decay coefficient of 0:1. In order to maximally leverage text representations from GPT, we base our code lexer on the GPT-3 text tokenizer. Since the distribution of words in GitHub code differs from that of natural text, this tokenizer is not very effective for representing code. The largest source of inefficiency arises from encoding whitespace, so we add an additional set of tokens for representing whitespace runs of different lengths. This allows us to represent code using approximately 30% fewer tokens.
Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn’t much “transfer learning” from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.
Are you surprised? That is precisely what you should expect from the transfer scaling lawpapers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.
Yes I completely agree. My point is that the fine-tuned version didn’t have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on.
I think also a part of it is: Hype doesn’t correlate 100% with actual capabilities of latest models. I predict that over the next two years the hype will grow, partly due to capability increases but also partly due to more people interacting with the tech more. The “man on the street” still hasn’t heard of GPT-3 or GPT-2 or DALL-E or whatever. I talked to an old friend working at a tech company the other day—he even did some basic ML stuff for his job—and he hadn’t heard of it. Then the hype will probably crash as unrealistic expectations fail to be met. But lol I’m just guessing, I am much less confident in all this than I am in my general views on timelines and takeoff, and I’m not exactly confident in those.
I am not confident in that part. I was imagining that they would be “only” 3x bigger or so, but that they’d be trained on much higher-quality data (incl. multimodal) and also trained for longer/more data, since corps would be not optimizing purely for training-compute-optimal performance but instead worrying a bit more about inference-time compute costs. Most importantly I expect them to be fine-tuned on various things (perhaps you can bundle this under “higher-quality data”). Think of how Codex and Copilot are much better than vanilla GPT-3 at coding. That’s the power of fine-tuning / data quality.
Also, 3x bigger than GPT-3 is still, like, 40x bigger than Codex, and Codex is pretty impressive. So I expect scale will be contributing some amount to the performance gains for things like code and image and video, albeit not so much for text since GPT-3-175B was already pretty big.
If Google’s multimodal model is already 100B parameters big, then I look forward to seeing its performance! Is it worse than GPT-3? If so, that would be evidence against my forecast, though we still have two years to go...
Fine-tuning GPT-3 on code had little benefit compared to training from scratch:
I wouldn’t categorize Codex under “benefits of fine-tuning/data quality” but under “benefits of specialization”. That’s because GPT-3 is trained on little code whereas Codex only on code. (And the Codex paper didn’t work on data quality more than the GPT-3 paper.)
Huh.… I coulda’ sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version...
The paper seems to support this interpretation when you add in more context to the quote you pulled:
Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn’t much “transfer learning” from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.
Are you surprised? That is precisely what you should expect from the transfer scaling law papers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.
Yes I completely agree. My point is that the fine-tuned version didn’t have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on.
I think also a part of it is: Hype doesn’t correlate 100% with actual capabilities of latest models. I predict that over the next two years the hype will grow, partly due to capability increases but also partly due to more people interacting with the tech more. The “man on the street” still hasn’t heard of GPT-3 or GPT-2 or DALL-E or whatever. I talked to an old friend working at a tech company the other day—he even did some basic ML stuff for his job—and he hadn’t heard of it. Then the hype will probably crash as unrealistic expectations fail to be met. But lol I’m just guessing, I am much less confident in all this than I am in my general views on timelines and takeoff, and I’m not exactly confident in those.