Are you surprised? That is precisely what you should expect from the transfer scaling lawpapers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.
Are you surprised? That is precisely what you should expect from the transfer scaling law papers: transfer works as an informative prior saving you a fixed amount of data in the target domain, but informative vs uninformative priors wash out in the limit of enough data—similar to how good prompts are worth a few hundred/thousand finetuning datapoints. If you have limited data in the target domain, transfer can be a huge win; but if you have huge amounts of data, it may be unimportant in terms of final converged performance (albeit potentially important for other reasons like saving compute!).
This is an application where you can scrape huge amounts of code from Github and the rest of the Internet (literally terabytes), so it’s unsurprising that you can reach the parity point.
No I’m not surprised, for exactly the reasons you mention. Had it been the case that Codex was trained from scratch because that was strictly better than fine-tuning, I would have been surprised.