Competitive programming with AlphaCode

Link post

Note: I’m totally going to alter the article based on the comments. Though of course I’ll give credit where it is due.

Edit: Apparently there is a post by one of the authors going in depth on the paper. Here you go if you don’t want to read this. He gives more details than the blog, and gives a neat example of one of the questions AlphaCode solves. But he doesn’t give a lot of details on what AlphaCode actually is/​

Predict the headline result. Go on, I dare you.

Deepmind released a new Transformer model which produced code in response to questions from contests on Codeforce. They simulated participation in 10 recent contests, and produced less than 10 potential solutions for each problem.

Any predictions on how likely people thought this was? Submit your answers below[1].

10%20%30%40%50%60%70%80%90%
Would you predict AlphaCode that AlphaCode can outperform the lower quartile of programmers on CodeForce's competitive programming contests?
10%20%30%40%50%60%70%80%90%
Would you predict AlphaCode that AlphaCode can outperform the median programmer on CodeForce's competitive programming contests?
10%20%30%40%50%60%70%80%90%
Would you predict AlphaCode that AlphaCode can outperform the lower three quartiles of programmers on CodeForce's competitive programming contests?

And the actual answer is… they ranked within the top 54% of the contestants[2]. I am personally impressed by this. What’s more impressive is that they substantially outperform SOTA on prior datasets (3% was SOTA on APPS, AlphaCode got 7%). Oh, they also made a new dataset because apparently the old ones sucked (34% of AlphaCode submissiond were correct). And since they built their model and techniques for the new one, they can’t compare the full model to the old, so they had to use a slightly crippled version.

Sidenote: You know what’s cool about Alphacode? They show you all the questions Alphacode attempted, and let you see all the changes it made to its code before it submitted it. Click here for a link. They also link to the paper on that page.

They made a new dataset because the old one had too many false positives?

The model was pre-trained on Github code, and finetuned on CodeContests, which was critical to performance. Each contest had newly made problems. Test/​validation questions were not present in the training set.

A new dataset (called CodeContests) was created for fine-tuning with a lot more hidden test cases than is normal for for competetive programming datasets[3]. Prior models would often pass test cases whilst being incorrect (30% false positive rate)[4]. Incorrect solutions were also included in the dataset. Those questions will be used for value conditiong. The questions were scraped from Codeforce and other programming platforms with meta-data added like problem difficulty, tags for training regarding the algorithm that should be used and data on whether the answer is correct or not so the model can condition its learning on that.

To make the test cases for problems, they mutated inputs from the testcases Codeforce gave them, applied 30 correct solutions to these inputs and chose those outputs which were produced by all solutions. This massively reduced the false positive rate as compared to the raw CodeContests. Whilst it also substantially lowered the rate of slow ||false positive samples, they still remain common (half the solutions are too slow but are still accepted by the tests they made).

Questions were regularly hundreds of words in length[5], with the implementation details left to the agent. Further, the models did not rely on duplicating code from training but focused on the natural language description to create novel solutions. As such, this presents a higher bar than what people seem to require of Codex. Which is basically just completing some trivial functions based off descriptions or completing them.

Their best model solved 34.2% on newly made test questions in their dataset, as compared to the prior state of the art of 1-5% on existing datasets. Which sounds pretty impressive, but is really an apples to oranges comparison as AlphaCide is tested on CodeForces and e.g. Codex is tested on APPS[6]. What’s worse is that AlphaCode can’t use all its new fancy-schmancy methods on the old data-set, so even that doesn’t tell you as much as you’d like. For reference though, the crippled (?) AlphaCode does about twice as well as Codex 12B and hundreds of times better than GPT-Neo 2.7B on APPS.

EDIT: added in to take in Jacob Pfau and Gwern’s comments. The Github code was about 750GB, much smaller than MassiveText (a 10TB corpus) As we can see from Table 7 from the paper[7], pre-training always helps, but GitHub helps a fair bit more than MassiveText across the board. And it implies the marginal value of a code corpus is much higher than the marginal value of a text corpus around about the 750GB mark.

Though one wonders an AlphaCode trained on MassiveText and GitHub could have achieved. As OpenAI showed with Codex, using a language model as an initialisation drastically improves training speed and helps performance. Increasing the size of the code corpus probably would have helped more though, and eventually might make pre-training using natural language corpuses irrelevant. Any sane prior is washed away with enough data after all.

Whilst we’re on the topic, Gwern raised an interesting question: if you trained AlphaCode on GitHub first, then trained it on MassiveText, would it perform better on reasoning tasks? I think this could be interesting to investigate along a natural abstraction line of reasoning. For example, say you have two data-sets for a image classifier. Dataset A gets it to recognise blankspaces, which happens to work well enough on its dataset. Dataset B trains it to recognise curves and colours and shapes. It seems reasonable training on A then B should be worse than training on B then A, as you’re using up some of B’s info to get out of the pit A led you to in the first case, whereas the second just hones its ability to detect certain sorts of shapes without damaging its lower level abstractions. Which also raises the question of whether moving from abstractions A to B is typically harder than moving from random noise to abstraction B.

What IS Alphacode Though?

But what is the actual model you ask? And how much did it help? Good questions! The answer is a little long winded, though you must have a high tolerance for that if you’re reading my post. First, I’m going to describe. No, I’ll just show you a graph tracking changes to the model first so you can see just how many changes they made and how much they helped[8]. Remember, I’m not describing things in chronological order.

EDIT: I said they started off at something state of the art, but I think I misunderstood the paper. Mea culpa. It is a baseline model which is approximately comparable to Codex for most settings and substantially better than GPT-Neo and GPT-J for all comparable settings.

But if you strip out all the changes they introduced in this paper to their prior model, you’d get performance equivalent to that of Codex. Which seems like a pretty big step up, but they did introduced a bunch of changes.

Model Time

A quick overview of the model: they pre-train ala GPT-3, finetune on their CodeContest dataset, produce a large number of answers, filter them down and then cluster them to pick samples for the final 10. Filtering is unique to their setup and greatly aided problem solve rate. Hence their “design decisions were made to facilitate efficient and effective sampling”. Here’s a pretty image from the blog post depicting the model:

Now for a long winded explanation. Initially they trained a transformer LM on Github ala GPT-3 with MLM. Then they fine turned on competitive programming data using GOLD [9] and tempering[10][11] for the training objectives alongside value conditioning. And multi-headed attention was used to reduce sampling costs.

They had a couple of architectural tricks, like using half as many tokens for the encoder as for the decoder as the problem descriptions on average twice as long as the solutions. They also found shallow encoders and deep decoders significantly improved the efficiency of training without hurting performance. I have no clue what’s going on there as this paper argues for deep encoders and shallow decoders for the same reason for machine translation tasks. Anyone care to weigh in? Oh, for tokenization they used a SentencePiece tokenizer to chunk up the code in Github (across all languages) and CodeContests into a compressed format. The encoder and decoder used the same tokenizer.

To reduce to reduce the cost of sampling they used mutli-query attention.[12][13] This is where they re-use a bunch of weights between parts of the network called attention heads (specifically the key and value weights. Watch these videos if you want to know what those are). This lowers memory and cache update costs, which are the main bottleneck during sampling.

As is normal for LMs, they used masking on data sets to augment the dataset. That’s what the MLMFor training they followed the scaling laws of Kaplan et al[14] but skimped out by stopping early on the 41B model. Fine-tuning involved in the graph stands for, presumably next to the deep encoder shallow decoder mystery. As is kind of the point of self supervised learning, MLM was essential for improving the representation learning of the encoder. For training they followed the scaling laws of Kaplan et al[14] but skimped out by stopping early on the 41B model.

EDIT: I misunderstood how they used tempering and am now confused. Apparently they used low T, which was the opposite of what the paper they cited (Dabre and Fujita 2020) did. This makes the training distribution sharper, and hence emphasizes lower confidence predictions during inference and increases variance of the model. Perhaps this helps in conjuction with sampling many answers and filtering them? Though they did not moderately higher T was better for models trained with gold.

Tempering involves altering the final part of the model from “output probability vector” to “output [Logits [answer probability vector]/​T]” i.e. it smooths out the output probabilities. You then add a cross entropy loss term[15] and use the above vector compared to the ground truth. This helps prevents overfitting, can increase gradient update sizes in the late training era and for high T it will train the model to produce answers it is more confident in. Since this technique is useful in low resource settings, did not increase training times much and AlphaCode didn’t have the compute budget to do things optimally as per Gwern, it makes sense that they’d use tempering. And it seems like it worked. Now I’m left wondering if this would have helped GPT-3 since they only did a couple of epochs on their dataset.

Wait, what about the other stuff in the graph?

What, you’re still reading? Well, this is the stuff they didn’t use in the ASSP tests. Remember the incorrect answers included in the CodeContests dataset? Well, they add meta-data to the question in training to tell the model whether the solution was incorrect so it knows how to learn from that. The model was trained to predict whether its submission was correct as well before projecting to logits. This whole method is called value conditioning.[16]

This is somewhat similair to the idea in IRL that an expert demonstrating perfect execution is not as enlightening as a teacher pointing out perfection and pitfalls. This helps the model stress test its abstractions. Looking at the performance graph, this seems like it works pretty well.

GOLD (Generation by Off-policy Learning from Demonstration) was designed to avoid a mismatch between training and evaluation objectives e.g. training for sequence prediction and evaluating answers based off quality. Which also helps prevent output quality from degrading as generation length increases, which is a notorious problem for GPT-3. I think what they do is take MLE (choose the parametrs that make the next token most likely assuming the prior ones were produced by humans) and alter it into something approximating “perceptual utility”[17] [18]which tries to choose parameters which predict the next token a human would given some base policy.

Finally we get to the last two pieces: filtering and clustering. Remember that the model produces many samples, and filters them for correctness on the example tests given in the problem statement. For 10% of problems, no sample is correct on all of the examples. But over all problems, filtering removes 99% of all samples. When you have 10,000-1,000,000 samples that’s not good enough. So they resort to clustering.

Clustering usually involves breaking up a dataset based off some properties that datapoints share. Here, they wanted to cluster programs which are semantically equivalent, so they can pick one of each kind of program and diversify their final 10 solutions as much as possible. Which can be done by if there are test inputs to apply sample solutions to and group them by how they answer the test questions. To make these test cases, a seperate model was trained with the same architecture as the main model to generate test inputs from problem descriptions using example, hidden and generated test inputs as training data. Even invalid test inputs can be useful for the sake of clustering. After clustering, they selected one solution from the largest to smallest cluster as this performed the best. They loop back over to the largest cluster and repeat if they have less than 10 clusters.

Based off the performance/​changes graph, clustering doesn’t really do that much. Though apparently the best model on the validation test was an ensemble of the 41B and 9B model with clustering, and on the test set the 49B model with clustering was just barely the best. Here ensembling means pooling their samples together as far as I can tell. Oh, scaling also seemed to help, but not as consistently as you’d think. The exceptions may be due to statistical noise. Look at the results sections for more.

And that’s what AlphaCode is.

Code Samples and Results

I’m Confused About How Tags Help Code

How about we look at an example problem from CodeContest to see the quality of code and what happens when AlphaCode fails and when it succeeds. We’ll be looking at “A Dislike of Threes”. Here is the failure and here is the success.

In case you’re too lazy to click a button, I’ll describe the problem and show the code here. The problem is that you’re given t integers, and need to output a number in each case. For each integer k, you need to produce the kth integer such that ¬k%3==0||¬k|10==3.

Here are AlphaCode’s two submissions:

A couple of things jump out at first. One, the uncommented code for the successful sample is simpler. Two, there’s a ton of commented out code on the left. Three, the tags between samples differ considerably. Let’s talk about tags first.

One of the changes Deepmind made to the code-contest questions was to give them tags describing what algorithms should be used. Naturally, these can only be present during training as otherwise it gives a bit of an advantage to Alphacode, and besides, other datasets don’t have any tags. So they trained their AI to also be able to predict tags. This is what is used in inference so AlphaCode doesn’t get confused by blank tags.

Examining how tags impacted the code seems useful. We see in “a Dislike of threes”[20] that giving it the tags “implementation, math” produced a simple brute force algorithm. Using “sortings, greedy” seemed to confuse it, with AlphaCode using the wrong datastructures and a nonsensical algorithm. So it seems to me that the tags were important to the success of the code. The paper tested this and found that was true:


But what is bizarre about this is that the “true problem tags” weren’t as useful as just giving any old random tags. You might have thought the reason the tags were useful was because they served to group together the algorithms learned by AlphaCode into higher abstractions and unify its understanding. This data seems to reject that point of view. Instead, it seems more likely that AlphaCode has learnt some unnatural abstractions, like grouping greedy search in some optimisation context with recursion in some combinatoral context and choosing the right tag is like gambling with a lottery ticket.

Onto the commented out code! In this sample, the commented out code does nothing useful. I have no real clue why it is there, or how it is even related to a potential problem solution. This seems to rule out the idea that the program is attempting to write rough code first to guide its thoughts or to mimic human debugging. Forgetting everything else, it comes after all of the correct code! The least terrible guess I have right now is that maybe succeful code in the training data contained comments. Which isn’t convincing.

Now, the commented code somewhat counters what the paper was says about code simplicity, but I think the point still. If you want a correct answer, it will probably be simple. Because the training data should bias it that way. Value conditioning, the thing where AlphaCode tries producing correct or incorrect code based off what you tell it to attempt, should make AlphaCode give simple answers when it tells it to give correct answers as correct answers are probably given by better programmers and they abhor repitition.

AlphaGo vs Older Models and Weird Results

Briefly on the apples to oranges comparison between AlphaCode and older models being unfair, I’d probably say that the things which were left out (value conditioning, tags and ratings) didn’t provide more than a ~10% improvement in CodeContests. Taking into account the much higher false positive rate on APPS, which Codex achieved a roughly 3% success rate on, and the previous improvements I feel like a a fairer comparison between AlphaCode and Codex wouldn’t change the picture a crazy amount. Maybe it would be thrice as capable, instead of twice as we saw by looking at APPS. So whilst I think AlphaCode is impressive, it doesn’t present anything game-changing like AlphaGo did. Onto other parts of the results.

A neat part of this paper is when they investigate how screwing up the problem description impacts the solutions and things go as you’d expect. But something interesting is that re-ordering chunks of the description in a way that shouldn’t change the content of the question doesn’t do much (within 10% change) to the performance for any model size. Given how high the variance is I expect that this could just be noise. So this implies the model has a pretty global understanding of the problem when it is creating code, and is looking for certain content in order to make the code for I/​O, for the algorithm, for data structures etc. Pretty sensible.

Maybe I’m going crazy due to a lack of sleep, but they seem to interpret one of their results in the exact opposite way I do. Look at the graph below.

Do you see that varying the number of solutions to a problem is more useful for a given sample budget than varying the number of problems you have? The other weird results in this paper include a couple of regions where performance falls with model size. I have no idea what to think of these, so I’m going to close my eyes and chant “high variance” to myself.

All their other results make sense: numbers go up, transfomer goes brrr. Stuff like how performance on how many samples pass example tests as you increase model size? It goes up. Increasing solve rate as you increase compute? It goes up. More seriously, solve rates scale log linearly with more samples, with more compute and larger models have higher slopes in the scaling curves when you use optimal model sizes according to Kaplan et al (2020).

Though sometimes the smaller models sometimes outperform the larger models. Though this may be due to high variance in the outputs and the small number of problems for some of the higher difficulty questions. Regardless, increased difficulty always hurts.

However, if you simplify the problem it tends to drastically increase model performance. Telling it the required algorithm greatly improves performance, so it leaves you wondering if it just has little shards of knowledge lying around, connected to only a few contextual clues or phrasings. Otherwise, it is unable to use them and hasn’t really grokked that they can apply everywhere.

Another in interesting thing is that AlphaCode fails badly on a great deal of problems, and increasing compute doesn’t do much to help (though the 41B model’s training was not finished). Observe this spiky graph

We can see that most problems in the validation set have essentially no solutions passing the example tests. That is, they fail at the filter step, not even on the test cases of the problem. And even for the questions where they have a visible fraction of answers passing the filter set, they still seem quite rare. Plus, this distribution looks like a power law and I am vaguely suspicious that this may be the result of some kind of random effect. You know the sort: stuff like probability of fixation of a gene, or the distributions of connections in an Erdos-Rayni graph. I’ll have to think about this a bit more to decide, but I’ll guess there is a maybe 10% chance this is true.

But hey, maybe enough shallow understanding across a wide enough domain just is general intelligence as Kaplan and Hanson seem to argue. And this is still some impressive shallow understanding. Here For more on this topic, see section 5 of Nostalgebraist’s post on language models dissapointing you.

The final important thing in their paper is probably this:

The author’s seem to think that since the model produces so many samples (1024 in this case) it can re-allocate probabilities away from atypical solutions and hence lose some marginal bits for more probability on typical, sensible solutions. This leads to a better solve rate, at least for a while. What I don’t get is why the validation loss keeps increasing past the point where the solve rate seems confident. I thought neural networks didn’t really overfit. What happened to my deep double descent?

Alignment

Now for the most important part. What’s worrying about this paper? Well, it dramatically overcame the prior SOTA, using its smallest model size and without value conditioning or tags/​ratings when comparing it to Codex and GPT-Neo/​J. It leaves some potential improvements on the table e.g. fully training bigger models, pre-train on more data, attatch the AI to a database to query facts about algorithms and and see which seem relevant for the program and so on. Its performance seemed a bit suprising to quite a few of us in the community. So that is evidence for shorter timelines and a lot of people should probably update on that.

On the other hand, it seems like it took a lot of different innovations and tricks that feel like they might not generalise that well (i.e. the filtering and the sampling add little with more compute). Further, there are still a bunch of false positives for the harder questions, and this whole thing feels like more evidence for the Paul-verse than the Eliezer verse.

Even so, I am surprised at how fast we got here and am quite troubled by this. So I guess I feel like we’re another step closer to doom than I thought. Thankfully, my updates are shallow and don’t reach all of my brain so I’ll be able to sleep soundly tonight.[19]

PostScript

I wrote, editted and published this article as I was reading the paper. My reason for doing this is that I have a tendency to write and stop halfway, with dozens of aborted drafts haunting my desktop. Publishing things piecemeal felt easier, and seeing people read and reply to me compelled me to keep writing. So this was good for me in the sense that I tried to do something useful. But was it good for the community? Should this even be allowed? Please vote on whether you think a moderator would say this kind of behaviour is acceptable:

10%20%30%40%50%60%70%80%90%
Will a moderator on Lesswrong tell me I was breaking a norm by writing and editing my report on AlphaCode live?
  1. ^

    I made three questions just so I don’t mess up your predictions through the act of asking.

  2. ^

    Given there were ten contests, this is just the average. If we limit the ratings to users who have participated in at least 1 contest in the last 6 months, AlphaCode is ranked (predict it first!) in the top 28%. The gap is due to top competitors competing more regularly, and so the pool of programmers in the latter case is more homogenous, resulting in a better Elo than average ranking.

  3. ^

    Solutions were evaluated for behaviour on test cases, edge cases as well as execution speed.

  4. ^

    Which will likely increase capabilities. Thanks Deepmind!

  5. ^

    They added metadata to questions, like tags for which approach should be used (“dp” or “greedy). These weren’t present in the test sets.

  6. ^

    Editted this thanks to Daniel Kotajenko for pointing out I was unclear.

  7. ^

    The digit after the @ is how many samples the model generates, and the 10 refers to the number of samples after the filter. This is not a green friendly model, though Nvidia may beg to differ.

  8. ^

    As an aside, this graph feels more Paul-verse than Eliezer-verse, and maybe more Hanson verse than Paul-verse?

  9. ^

    Pang and He, 2020, “TEXT GENERATION BY LEARNING FROM DEMONSTRA

    TIONS”

  10. ^
  11. ^

    I love how metal that sentence sounds.

  12. ^
  13. ^

    What’s going on with Deepmind? Are they running out of money to run ML experiments? Is Google strangling their funding? Was this team made up of social pariahs because they’re increasing the AI doom-ometer? Are we getting off the ever climbing compute rocket?

  14. ^
  15. ^

    Only for human labelled examples, which are called “gold” labels in self supervised learning.

  16. ^

    Naturally during inference they needed to tell the model to condition on the sample being correct.

  17. ^

    Huszar, 2015, “How (not) to train your model: Scheduled Sampling, Likelihood, Adversary?

    Notably, they claim that MLE + adversarial training leads you towards their training objective

  18. ^
  19. ^

    Wait. Uh oh.

  20. ^

    The paper analysed the problem “Gregor and Cryptography”, they found the same thing. Using “number theory” improve correctness rates by a factor of three and achieved minimal complexity a factor of four times more often than using the “brute force” as the tag.