See also “Evaluating Large Language Models Trained on Code”, OpenAI’s contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).
Notice that technical details are having a large impact here:
GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
GPT-J had Github as a substantial fraction of its training set
The dataset for Google’s 137-billion model is not public but apparently “somewhat oversampled web pages that contain code”. They also try fine-tuning on a very small dataset (374 items).
Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
OpenAI’s largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked
See also “Evaluating Large Language Models Trained on Code”, OpenAI’s contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).
Notice that technical details are having a large impact here:
GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
GPT-J had Github as a substantial fraction of its training set
The dataset for Google’s 137-billion model is not public but apparently “somewhat oversampled web pages that contain code”. They also try fine-tuning on a very small dataset (374 items).
Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
OpenAI’s largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked
Thanks, I probably should have linked to my summary of that paper in this newsletter.