Huh.… I coulda’ sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version...
The paper seems to support this interpretation when you add in more context to the quote you pulled:
We fine-tune GPT models containing up to 12B parameters on code to produce Codex. … Since Codex is evaluated on natural language prompts, we hypothesized that it would be beneficial to fine-tune from the GPT-3 (Brown et al., 2020) model family, which already contains strong natural language representations. Surprisingly, we did not observe improvements when starting from a pre-trained language model, possibly because the fine-tuning dataset is so large. Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments. We train Codex using the same learning rate as the corresponding GPT model, with a 175 step linear warmup and cosine learning rate decay. We train for a total of 100 billion tokens, using the Adam optimizer with 1 = 0:9, 2 = 0:95, = 10 8 , and a weight decay coefficient of 0:1. In order to maximally leverage text representations from GPT, we base our code lexer on the GPT-3 text tokenizer. Since the distribution of words in GitHub code differs from that of natural text, this tokenizer is not very effective for representing code. The largest source of inefficiency arises from encoding whitespace, so we add an additional set of tokens for representing whitespace runs of different lengths. This allows us to represent code using approximately 30% fewer tokens.
Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn’t much “transfer learning” from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.
Are you surprised? That is precisely what you should expect from the transfer scaling lawpapers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.
Yes I completely agree. My point is that the fine-tuned version didn’t have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on.
Huh.… I coulda’ sworn they said Codex was pre-trained on internet text as well as on code, and that it was in particular a version of GPT-3, the 12B param version...
The paper seems to support this interpretation when you add in more context to the quote you pulled:
Note the bits I bolded. My interpretation is that Codex is indeed a fine-tuned version of GPT-3-12B; the thing they found surprising was that there wasn’t much “transfer learning” from text to code, in the sense that (when they did smaller-scale experiments) models trained from scratch reached the same level of performance. So if models trained from scratch reached the same level of performance, why fine-tune from GPT-3? Answer: Because it converges more quickly that way. Saves compute.
Are you surprised? That is precisely what you should expect from the transfer scaling law papers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.
Yes I completely agree. My point is that the fine-tuned version didn’t have better final coding performance than the version trained only on code. I also agree that fine-tuning will probably improve performance on the specific tasks we fine-tune on.