tin482 comments on [AN #164]: How well can language models write code?

tin482 16 Sep 2021 16:59 UTC
3 points
See also “Evaluating Large Language Models Trained on Code”, OpenAI’s contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).
Notice that technical details are having a large impact here:
- GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
- GPT-J had Github as a substantial fraction of its training set
- The dataset for Google’s 137-billion model is not public but apparently “somewhat oversampled web pages that contain code”. They also try fine-tuning on a very small dataset (374 items).
- Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
- OpenAI’s largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked
- Rohin Shah 17 Sep 2021 7:42 UTC
  2 points
  Parent
  Thanks, I probably should have linked to my summary of that paper in this newsletter.