Note: I’m totally going to alter the article based on the comments. Though of course I’ll give credit where it is due.
Edit: Apparently there is a post by one of the authors going in depth on the paper. Here you go if you don’t want to read this. He gives more details than the blog, and gives a neat example of one of the questions AlphaCode solves. But he doesn’t give a lot of details on what AlphaCode actually is/
Predict the headline result. Go on, I dare you.
Deepmind released a new Transformer model which produced code in response to questions from contests on Codeforce. They simulated participation in 10 recent contests, and produced less than 10 potential solutions for each problem.
Any predictions on how likely people thought this was? Submit your answers below[1].
And the actual answer is… they ranked within the top 54% of the contestants[2]. I am personally impressed by this. What’s more impressive is that they substantially outperform SOTA on prior datasets (3% was SOTA on APPS, AlphaCode got 7%). Oh, they also made a new dataset because apparently the old ones sucked (34% of AlphaCode submissiond were correct). And since they built their model and techniques for the new one, they can’t compare the full model to the old, so they had to use a slightly crippled version.
Sidenote: You know what’s cool about Alphacode? They show you all the questions Alphacode attempted, and let you see all the changes it made to its code before it submitted it. Click here for a link. They also link to the paper on that page.
They made a new dataset because the old one had too many false positives?
The model was pre-trained on Github code, and finetuned on CodeContests, which was critical to performance. Each contest had newly made problems. Test/validation questions were not present in the training set.
A new dataset (called CodeContests) was created for fine-tuning with a lot more hidden test cases than is normal for for competetive programming datasets[3]. Prior models would often pass test cases whilst being incorrect (30% false positive rate)[4]. Incorrect solutions were also included in the dataset. Those questions will be used for value conditiong. The questions were scraped from Codeforce and other programming platforms with meta-data added like problem difficulty, tags for training regarding the algorithm that should be used and data on whether the answer is correct or not so the model can condition its learning on that.
To make the test cases for problems, they mutated inputs from the testcases Codeforce gave them, applied 30 correct solutions to these inputs and chose those outputs which were produced by all solutions. This massively reduced the false positive rate as compared to the raw CodeContests. Whilst it also substantially lowered the rate of slow ||false positive samples, they still remain common (half the solutions are too slow but are still accepted by the tests they made).
Questions were regularly hundreds of words in length[5], with the implementation details left to the agent. Further, the models did not rely on duplicating code from training but focused on the natural language description to create novel solutions. As such, this presents a higher bar than what people seem to require of Codex. Which is basically just completing some trivial functions based off descriptions or completing them.
Their best model solved 34.2% on newly made test questions in their dataset, as compared to the prior state of the art of 1-5% on existing datasets. Which sounds pretty impressive, but is really an apples to oranges comparison as AlphaCide is tested on CodeForces and e.g. Codex is tested on APPS[6]. What’s worse is that AlphaCode can’t use all its new fancy-schmancy methods on the old data-set, so even that doesn’t tell you as much as you’d like. For reference though, the crippled (?) AlphaCode does about twice as well as Codex 12B and hundreds of times better than GPT-Neo 2.7B on APPS.
EDIT: added in to take in Jacob Pfau and Gwern’s comments. The Github code was about 750GB, much smaller than MassiveText (a 10TB corpus) As we can see from Table 7 from the paper[7], pre-training always helps, but GitHub helps a fair bit more than MassiveText across the board. And it implies the marginal value of a code corpus is much higher than the marginal value of a text corpus around about the 750GB mark.
Though one wonders an AlphaCode trained on MassiveText and GitHub could have achieved. As OpenAI showed with Codex, using a language model as an initialisation drastically improves training speed and helps performance. Increasing the size of the code corpus probably would have helped more though, and eventually might make pre-training using natural language corpuses irrelevant. Any sane prior is washed away with enough data after all.
Whilst we’re on the topic, Gwern raised an interesting question: if you trained AlphaCode on GitHub first, then trained it on MassiveText, would it perform better on reasoning tasks? I think this could be interesting to investigate along a natural abstraction line of reasoning. For example, say you have two data-sets for a image classifier. Dataset A gets it to recognise blankspaces, which happens to work well enough on its dataset. Dataset B trains it to recognise curves and colours and shapes. It seems reasonable training on A then B should be worse than training on B then A, as you’re using up some of B’s info to get out of the pit A led you to in the first case, whereas the second just hones its ability to detect certain sorts of shapes without damaging its lower level abstractions. Which also raises the question of whether moving from abstractions A to B is typically harder than moving from random noise to abstraction B.
What IS Alphacode Though?
But what is the actual model you ask? And how much did it help? Good questions! The answer is a little long winded, though you must have a high tolerance for that if you’re reading my post. First, I’m going to describe. No, I’ll just show you a graph tracking changes to the model first so you can see just how many changes they made and how much they helped[8]. Remember, I’m not describing things in chronological order.
EDIT: I said they started off at something state of the art, but I think I misunderstood the paper. Mea culpa. It is a baseline model which is approximately comparable to Codex for most settings and substantially better than GPT-Neo and GPT-J for all comparable settings.
But if you strip out all the changes they introduced in this paper to their prior model, you’d get performance equivalent to that of Codex. Which seems like a pretty big step up, but they did introduced a bunch of changes.
Model Time
A quick overview of the model: they pre-train ala GPT-3, finetune on their CodeContest dataset, produce a large number of answers, filter them down and then cluster them to pick samples for the final 10. Filtering is unique to their setup and greatly aided problem solve rate. Hence their “design decisions were made to facilitate efficient and effective sampling”. Here’s a pretty image from the blog post depicting the model:
Now for a long winded explanation. Initially they trained a transformer LM on Github ala GPT-3 with MLM. Then they fine turned on competitive programming data using GOLD [9] and tempering[10][11] for the training objectives alongside value conditioning. And multi-headed attention was used to reduce sampling costs.
They had a couple of architectural tricks, like using half as many tokens for the encoder as for the decoder as the problem descriptions on average twice as long as the solutions. They also found shallow encoders and deep decoders significantly improved the efficiency of training without hurting performance. I have no clue what’s going on there as this paper argues for deep encoders and shallow decoders for the same reason for machine translation tasks. Anyone care to weigh in? Oh, for tokenization they used a SentencePiece tokenizer to chunk up the code in Github (across all languages) and CodeContests into a compressed format. The encoder and decoder used the same tokenizer.
To reduce to reduce the cost of sampling they used mutli-query attention.[12][13] This is where they re-use a bunch of weights between parts of the network called attention heads (specifically the key and value weights. Watch these videos if you want to know what those are). This lowers memory and cache update costs, which are the main bottleneck during sampling.
As is normal for LMs, they used masking on data sets to augment the dataset. That’s what the MLMFor training they followed the scaling laws of Kaplan et al[14] but skimped out by stopping early on the 41B model. Fine-tuning involved in the graph stands for, presumably next to the deep encoder shallow decoder mystery. As is kind of the point of self supervised learning, MLM was essential for improving the representation learning of the encoder. For training they followed the scaling laws of Kaplan et al[14] but skimped out by stopping early on the 41B model.
EDIT: I misunderstood how they used tempering and am now confused. Apparently they used low T, which was the opposite of what the paper they cited (Dabre and Fujita 2020) did. This makes the training distribution sharper, and hence emphasizes lower confidence predictions during inference and increases variance of the model. Perhaps this helps in conjuction with sampling many answers and filtering them? Though they did not moderately higher T was better for models trained with gold.
Tempering involves altering the final part of the model from “output probability vector” to “output [Logits [answer probability vector]/T]” i.e. it smooths out the output probabilities. You then add a cross entropy loss term[15] and use the above vector compared to the ground truth. This helps prevents overfitting, can increase gradient update sizes in the late training era and for high T it will train the model to produce answers it is more confident in. Since this technique is useful in low resource settings, did not increase training times much and AlphaCode didn’t have the compute budget to do things optimally as per Gwern, it makes sense that they’d use tempering. And it seems like it worked. Now I’m left wondering if this would have helped GPT-3 since they only did a couple of epochs on their dataset.
Wait, what about the other stuff in the graph?
What, you’re still reading? Well, this is the stuff they didn’t use in the ASSP tests. Remember the incorrect answers included in the CodeContests dataset? Well, they add meta-data to the question in training to tell the model whether the solution was incorrect so it knows how to learn from that. The model was trained to predict whether its submission was correct as well before projecting to logits. This whole method is called value conditioning.[16]
This is somewhat similair to the idea in IRL that an expert demonstrating perfect execution is not as enlightening as a teacher pointing out perfection and pitfalls. This helps the model stress test its abstractions. Looking at the performance graph, this seems like it works pretty well.
GOLD (Generation by Off-policy Learning from Demonstration) was designed to avoid a mismatch between training and evaluation objectives e.g. training for sequence prediction and evaluating answers based off quality. Which also helps prevent output quality from degrading as generation length increases, which is a notorious problem for GPT-3. I think what they do is take MLE (choose the parametrs that make the next token most likely assuming the prior ones were produced by humans) and alter it into something approximating “perceptual utility”[17] [18]which tries to choose parameters which predict the next token a human would given some base policy.
Finally we get to the last two pieces: filtering and clustering. Remember that the model produces many samples, and filters them for correctness on the example tests given in the problem statement. For 10% of problems, no sample is correct on all of the examples. But over all problems, filtering removes 99% of all samples. When you have 10,000-1,000,000 samples that’s not good enough. So they resort to clustering.
Clustering usually involves breaking up a dataset based off some properties that datapoints share. Here, they wanted to cluster programs which are semantically equivalent, so they can pick one of each kind of program and diversify their final 10 solutions as much as possible. Which can be done by if there are test inputs to apply sample solutions to and group them by how they answer the test questions. To make these test cases, a seperate model was trained with the same architecture as the main model to generate test inputs from problem descriptions using example, hidden and generated test inputs as training data. Even invalid test inputs can be useful for the sake of clustering. After clustering, they selected one solution from the largest to smallest cluster as this performed the best. They loop back over to the largest cluster and repeat if they have less than 10 clusters.
Based off the performance/changes graph, clustering doesn’t really do that much. Though apparently the best model on the validation test was an ensemble of the 41B and 9B model with clustering, and on the test set the 49B model with clustering was just barely the best. Here ensembling means pooling their samples together as far as I can tell. Oh, scaling also seemed to help, but not as consistently as you’d think. The exceptions may be due to statistical noise. Look at the results sections for more.
And that’s what AlphaCode is.
Code Samples and Results
I’m Confused About How Tags Help Code
How about we look at an example problem from CodeContest to see the quality of code and what happens when AlphaCode fails and when it succeeds. We’ll be looking at “A Dislike of Threes”. Here is the failure and here is the success.
In case you’re too lazy to click a button, I’ll describe the problem and show the code here. The problem is that you’re given t integers, and need to output a number in each case. For each integer k, you need to produce the kth integer such that ¬k%3==0||¬k|10==3.
Here are AlphaCode’s two submissions:
A couple of things jump out at first. One, the uncommented code for the successful sample is simpler. Two, there’s a ton of commented out code on the left. Three, the tags between samples differ considerably. Let’s talk about tags first.
One of the changes Deepmind made to the code-contest questions was to give them tags describing what algorithms should be used. Naturally, these can only be present during training as otherwise it gives a bit of an advantage to Alphacode, and besides, other datasets don’t have any tags. So they trained their AI to also be able to predict tags. This is what is used in inference so AlphaCode doesn’t get confused by blank tags.
Examining how tags impacted the code seems useful. We see in “a Dislike of threes”[20] that giving it the tags “implementation, math” produced a simple brute force algorithm. Using “sortings, greedy” seemed to confuse it, with AlphaCode using the wrong datastructures and a nonsensical algorithm. So it seems to me that the tags were important to the success of the code. The paper tested this and found that was true:
But what is bizarre about this is that the “true problem tags” weren’t as useful as just giving any old random tags. You might have thought the reason the tags were useful was because they served to group together the algorithms learned by AlphaCode into higher abstractions and unify its understanding. This data seems to reject that point of view. Instead, it seems more likely that AlphaCode has learnt some unnatural abstractions, like grouping greedy search in some optimisation context with recursion in some combinatoral context and choosing the right tag is like gambling with a lottery ticket.
Onto the commented out code! In this sample, the commented out code does nothing useful. I have no real clue why it is there, or how it is even related to a potential problem solution. This seems to rule out the idea that the program is attempting to write rough code first to guide its thoughts or to mimic human debugging. Forgetting everything else, it comes after all of the correct code! The least terrible guess I have right now is that maybe succeful code in the training data contained comments. Which isn’t convincing.
Now, the commented code somewhat counters what the paper was says about code simplicity, but I think the point still. If you want a correct answer, it will probably be simple. Because the training data should bias it that way. Value conditioning, the thing where AlphaCode tries producing correct or incorrect code based off what you tell it to attempt, should make AlphaCode give simple answers when it tells it to give correct answers as correct answers are probably given by better programmers and they abhor repitition.
AlphaGo vs Older Models and Weird Results
Briefly on the apples to oranges comparison between AlphaCode and older models being unfair, I’d probably say that the things which were left out (value conditioning, tags and ratings) didn’t provide more than a ~10% improvement in CodeContests. Taking into account the much higher false positive rate on APPS, which Codex achieved a roughly 3% success rate on, and the previous improvements I feel like a a fairer comparison between AlphaCode and Codex wouldn’t change the picture a crazy amount. Maybe it would be thrice as capable, instead of twice as we saw by looking at APPS. So whilst I think AlphaCode is impressive, it doesn’t present anything game-changing like AlphaGo did. Onto other parts of the results.
A neat part of this paper is when they investigate how screwing up the problem description impacts the solutions and things go as you’d expect. But something interesting is that re-ordering chunks of the description in a way that shouldn’t change the content of the question doesn’t do much (within 10% change) to the performance for any model size. Given how high the variance is I expect that this could just be noise. So this implies the model has a pretty global understanding of the problem when it is creating code, and is looking for certain content in order to make the code for I/O, for the algorithm, for data structures etc. Pretty sensible.
Maybe I’m going crazy due to a lack of sleep, but they seem to interpret one of their results in the exact opposite way I do. Look at the graph below.
Do you see that varying the number of solutions to a problem is more useful for a given sample budget than varying the number of problems you have? The other weird results in this paper include a couple of regions where performance falls with model size. I have no idea what to think of these, so I’m going to close my eyes and chant “high variance” to myself.
All their other results make sense: numbers go up, transfomer goes brrr. Stuff like how performance on how many samples pass example tests as you increase model size? It goes up. Increasing solve rate as you increase compute? It goes up. More seriously, solve rates scale log linearly with more samples, with more compute and larger models have higher slopes in the scaling curves when you use optimal model sizes according to Kaplan et al (2020).
Though sometimes the smaller models sometimes outperform the larger models. Though this may be due to high variance in the outputs and the small number of problems for some of the higher difficulty questions. Regardless, increased difficulty always hurts.
However, if you simplify the problem it tends to drastically increase model performance. Telling it the required algorithm greatly improves performance, so it leaves you wondering if it just has little shards of knowledge lying around, connected to only a few contextual clues or phrasings. Otherwise, it is unable to use them and hasn’t really grokked that they can apply everywhere.
Another in interesting thing is that AlphaCode fails badly on a great deal of problems, and increasing compute doesn’t do much to help (though the 41B model’s training was not finished). Observe this spiky graph
We can see that most problems in the validation set have essentially no solutions passing the example tests. That is, they fail at the filter step, not even on the test cases of the problem. And even for the questions where they have a visible fraction of answers passing the filter set, they still seem quite rare. Plus, this distribution looks like a power law and I am vaguely suspicious that this may be the result of some kind of random effect. You know the sort: stuff like probability of fixation of a gene, or the distributions of connections in an Erdos-Rayni graph. I’ll have to think about this a bit more to decide, but I’ll guess there is a maybe 10% chance this is true.
But hey, maybe enough shallow understanding across a wide enough domain just is general intelligence as Kaplan and Hanson seem to argue. And this is still some impressive shallow understanding. Here For more on this topic, see section 5 of Nostalgebraist’s post on language models dissapointing you.
The final important thing in their paper is probably this:
The author’s seem to think that since the model produces so many samples (1024 in this case) it can re-allocate probabilities away from atypical solutions and hence lose some marginal bits for more probability on typical, sensible solutions. This leads to a better solve rate, at least for a while. What I don’t get is why the validation loss keeps increasing past the point where the solve rate seems confident. I thought neural networks didn’t really overfit. What happened to my deep double descent?
Alignment
Now for the most important part. What’s worrying about this paper? Well, it dramatically overcame the prior SOTA, using its smallest model size and without value conditioning or tags/ratings when comparing it to Codex and GPT-Neo/J. It leaves some potential improvements on the table e.g. fully training bigger models, pre-train on more data, attatch the AI to a database to query facts about algorithms and and see which seem relevant for the program and so on. Its performance seemed a bit suprising to quite a few of us in the community. So that is evidence for shorter timelines and a lot of people should probably update on that.
On the other hand, it seems like it took a lot of different innovations and tricks that feel like they might not generalise that well (i.e. the filtering and the sampling add little with more compute). Further, there are still a bunch of false positives for the harder questions, and this whole thing feels like more evidence for the Paul-verse than the Eliezer verse.
Even so, I am surprised at how fast we got here and am quite troubled by this. So I guess I feel like we’re another step closer to doom than I thought. Thankfully, my updates are shallow and don’t reach all of my brain so I’ll be able to sleep soundly tonight.[19]
PostScript
I wrote, editted and published this article as I was reading the paper. My reason for doing this is that I have a tendency to write and stop halfway, with dozens of aborted drafts haunting my desktop. Publishing things piecemeal felt easier, and seeing people read and reply to me compelled me to keep writing. So this was good for me in the sense that I tried to do something useful. But was it good for the community? Should this even be allowed? Please vote on whether you think a moderator would say this kind of behaviour is acceptable:
- ^
I made three questions just so I don’t mess up your predictions through the act of asking.
- ^
Given there were ten contests, this is just the average. If we limit the ratings to users who have participated in at least 1 contest in the last 6 months, AlphaCode is ranked (predict it first!) in the top 28%. The gap is due to top competitors competing more regularly, and so the pool of programmers in the latter case is more homogenous, resulting in a better Elo than average ranking.
- ^
Solutions were evaluated for behaviour on test cases, edge cases as well as execution speed.
- ^
Which will likely increase capabilities. Thanks Deepmind!
- ^
They added metadata to questions, like tags for which approach should be used (“dp” or “greedy). These weren’t present in the test sets.
- ^
Editted this thanks to Daniel Kotajenko for pointing out I was unclear.
- ^
The digit after the @ is how many samples the model generates, and the 10 refers to the number of samples after the filter. This is not a green friendly model, though Nvidia may beg to differ.
- ^
As an aside, this graph feels more Paul-verse than Eliezer-verse, and maybe more Hanson verse than Paul-verse?
- ^
Pang and He, 2020, “TEXT GENERATION BY LEARNING FROM DEMONSTRA
TIONS”
- ^
Dabre and Fujita, 2021, “Investigating Softmax Tempering for Training Neural Machine Translation Models”
- ^
I love how metal that sentence sounds.
- ^
Shazeer, 2019, “Fast Transformer Decoding: One Write-Head is All you Need”
- ^
What’s going on with Deepmind? Are they running out of money to run ML experiments? Is Google strangling their funding? Was this team made up of social pariahs because they’re increasing the AI doom-ometer? Are we getting off the ever climbing compute rocket?
- ^
Kaplan et al, 2020, “https://arxiv.org/abs/2001.08361″
- ^
Only for human labelled examples, which are called “gold” labels in self supervised learning.
- ^
Naturally during inference they needed to tell the model to condition on the sample being correct.
- ^
Huszar, 2015, “How (not) to train your model: Scheduled Sampling, Likelihood, Adversary?”
Notably, they claim that MLE + adversarial training leads you towards their training objective
- ^
- ^
Wait. Uh oh.
- ^
The paper analysed the problem “Gregor and Cryptography”, they found the same thing. Using “number theory” improve correctness rates by a factor of three and achieved minimal complexity a factor of four times more often than using the “brute force” as the tag.
Cool to see this acknowledged in the paper!
I guess I would have liked more discussion on how this impacted Deepmind’s views on the topic. Could we get a discussion started on that in the comment section? I’ve added some thoughts to the article on the topic, but I haven’t sat down and done explicit calculations for how I should update yet, or how this ties into stuff like Johnwentworth’s research agenda, or how much more likely we are to be in Paul-verse or even if Paul and Eliezer would agree that this result favours Paul-verse.
In which way does this news “favour Paul-verse”?
Tons of small improvements can make big changes to performance, in a way that I did not think they would be able to. Or at least, not so soon. If there was just one change they made and it resulted in the same performance increase, I’d say that is way more Eliezer verse.
Huh. Doesn’t this also push towards faster progress and thus faster takeoff though, and thus Eliezer-verse?
Paul seems down with progress being potentially quite fast. I’d be willing to bet he thinks a takeoff in 15 years is on the cusp of plausibility. I agree that the speed is maybe more Eliezer verse, but how it happened is more Paulverse. So I’d guess l(AlphaCode|Paulverse)>l(AlphaCode|Eliezerverse). But I don’t understand Eliezerverse enough to be confident about the numerical value of the likelihoods.
Idk. At this point I want to taboo “Paul-verse” and “Eliezer-verse” and use more specific, descriptive terms instead.
My reading of Appendix A is that the group did its own judging, i.e., did not submit answers to Codeforces.
They generated lots of human verified test data, but then human implementors would do something similar.
They trained on Github code, plus solutions code on Codeforces. Did they train on Codeforces solutions code that solved any of the problems? Without delving much deeper into the work, I cannot say. They do call out the fact that the solutions did not include chunks of copy-pasted code.
To what extent are the successes presented representative of the problems tried? That is, did they try to solve lots of problems and we are seeing the cases that worked well? The fact that they were able to get solutions to some problems was impressive.
The solved problems had short solutions. How well does the technique scale to problems requiring more code for their solution? I suspect it doesn’t, but then there are applications where the solutions are often short.
It’s worth noting that Table 7 shows Github pre-training outperforming MassiveText (natural language corpus) pre-training. The AlphaCode dataset is 715GB compared to the 10TB of MassiveText (which includes 3TB of Github). I have not read the full details of both cleaning processes, but I assume that the cleaning / de-duplication process is more thorough in the case of the AlphaCode Github only dataset. EDIT: see also Algon’s comment on this below.
I know of a few EAs who thought that natural language pre-training will continue to provide relevant performance increases for coding as training scales up over the next few years, and I see this as strong evidence against that claim. One remaining question might be whether this finding is an artefact of code dataset growth temporarily accelerating relative to NL dataset size growth. Insofar as dataset sizes are constrained by compute rather than absolute data availability, I think we should expect code dataset sizes to approach NL dataset sizes. Also cf. my recent Metaculus question on future prospects for NL-pretraining to programming transfer.
I think that was largely settled by the earlier work on transfer scaling laws and Bayesian hierarchical interpretations: pretraining provides an informative prior which increases sample-efficiency in a related task, providing in essence a fixed n sample gain. But enough data washes out the prior, whether informative or uniform. So if you have enough data and compute (which you usually don’t), transfer results in the same final performance. This is also true in, say, image classification. Stuff like CLIP is great for transfer learning—unless you have millions of labeled images in your target domain, in which case, yeah sure it’ll probably just match the from-scratch model. (How else could things possibly work?) 715GB of Github is definitely large enough that it washes out the natural language prior! But as the Copilot paper also points out, you still get a big benefit in that you can train a lot less when you start with GPT-3 as the prior. Nothing to sneeze at, and I suspect DM would’ve gotten even better results here if they had started with Gopher rather than their from-scratch dataset as their compute budget would stretch much further and they could do much less brute-force rejection sampling of program candidates.
It’s also direction-specific. Natural language may wash out when you target Github… but does Github wash out when you target natural language? I will be curious to see if anyone tries using the coding models as the prior for language modeling instead of vice-versa, and if that leads to noticeable gains on the various reasoning-esque benchmarks.
But it does not show that pretraining on natural language plus Github is worse than Github-only. This is also what you’d expect from Copilot showing that GPT-3 (natural language) initialization trains much faster to the same performance level.
I agree that the scaling laws for transfer paper already strongly suggested that pre-training would eventually not provide much in terms of performance gain. I remember doing a back-of-the-envelope for whether 2025 would still use pre-training (and finding it wouldn’t improve performance), but I certainly didn’t expect us to reach this point in early 2022. I also had some small, but significant uncertainty regarding how well the scaling laws result would hold up when switching dataset+model+modelsize, and so the AlphaCode data point is useful in that regard as well.
As for the point on accelerating training, this makes intuitive sense to me, but it’s not clear to me how relevant this is? Figure 7 of Laws for Transfer shows that the compute needed to plateau on their largest models with and without pre-training looks to be within an OOM?
An OOM is nothing to sneeze at, especially when you can get it for free by training an off-the-shelf pretrained model (DM already trained a Gopher, it doesn’t cost any more to reuse!) exactly as you would otherwise, no compromises or deadends like MoEs. Note that AlphaCode didn’t have the compute budget to do its approach optimally.
Why didn’t they use Gopher then for AlphaCode? Maybe Gopher wasn’t done training yet?
Possibly but the timelines don’t quite seem to line up. On Twitter, DMers are describing this as a 2-year project, implying AlphaCode started ~February 2020. GPT-3 wouldn’t come out until May 2020 and obviously Codex/Copilot didn’t come out until mid-2021, but there were already Transformer for code generation (even assistance ones like TabNine) and so this is pretty much the obvious way to keep going and ‘2 years’ is entirely plausible as the timespan. Now, Gopher is described as starting (or was it finishing?) training in December 2020, so it became available about half-way through: they had all of 2021 & January 2022 to drop Gopher into their training & evaluation framework. I know there’s always a lot of inertia and everything always takes longer than outsiders predict on projects this complicated (look at the acknowledgements and staff section)… but I think that’s probably enough time that they could have used Gopher if they had really wanted to, unless this project was very frontloaded and mostly done by the time Gopher came around and they spent most of 2021 doing stuff like writing it up or evaluating it?
It seems equally plausible to me that they ran out of their allotted compute by the time Gopher came around and that even if it would be on net more efficient to train a Gopher, they had already spent their quota. DM doesn’t have an infinite budget and can’t pursue everything to the logical endpoint (like how AlphaStar was ignominously dropped right around where it had added harder APM limits / was using the camera like a human / training on all races+maps, but was only human-pro-level and hadn’t AlphaGo’d humans yet).
Yes, I agree certainly at 2025 training run prices, saving 2-5x on a compute run will be done whenever possible. For this reason, I’d like to see more predictions on my Metaculus question!
They chuck out larger files from Github (1MB+), or with lines longer than 1000 characters to exclude automatically generated code. I’m guessing the former is because programs that’s too long just aren’t useful when your model’s context windows are tiny in comparison. They did also get rid of duplicates. Plus, they needed to avoid code published after the questions in the dataset were made, to avoid leaking the answer.
As to natural language training, I suppose I’d agree that it is some evidence against the claim. But I’d say it would be strong evidence if they also trained Alphafold on MassiveText and found little to no performance increase. I wouldn’t be surprised if it didn’t do much though.
Edit: Oh, I just saw Table 7. Yeah, that’s pretty strong evidence against the claim that natural language corpuses are anywhere near as useful as code corpuses for this kind of stuff. But I am a little surpised that it added as much as it did.
Thanks, I’m grateful to you for reading the paper more closely than I did and reporting back interesting findings!
Can you elaborate on this? Was the prior state of the art Codex? It sounds like you are saying Codex etc. solved 1-5% of various coding question datasets, and AlphaCode 41B solves 34.2% of their new coding question dataset, which is a huge leap forward if the datasets are comparably difficult (and in fact arguably the new dataset is more difficult?). Is this what you are saying? Where did you see that in the paper? I only skimmed it but the best I found was Figure 10 which didn’t quite contain the info I wanted.
See the second paragraph from the bottom of the paper. What they’re saying is that their model solves 34.2% of questions on the dataset they built for the model. Old models couldn’t have been tested on this, as they predate it. Instead, they were tested on tje APPS benchmark (amongst others). If you want to compare it to old models in a fair way, you need to test it on APPS. Which they did, though owing to dataset differences they couldn’t use certain parts of their methods. That’s what you see in Table 10.
I’ll try to cover this in the post, but that’s probably going to be a couple of hours.
Should this other post be a separate linkpost for this? https://www.furidamu.org/blog/2022/02/02/competitive-programming-with-alphacode/#fnref:2
Feels like it covers the same, but is a personal description by an author, rather than the deepmind presser.
Thanks for mentioning that. I didn’t know it existed. But I’m not sure what you mean by “a seperate linkpost”? Anyway, I’ll link to it at the beginning of the post.
Oh great, thanks. I think I was just asking folks if they thought it should be discussed separately (since it is a different piece) or together with this one (since they’re describing the same research).
I don’t think it is worth spinning up a new article on it, but I’m a karma-junkie so my opinion is a little biased. More seriously, he doesn’t say much that the blog post doesn’t, or that you can’t see by going on the link to examples of AlphaCode’s code. But it is probably better presented.
How does this affect timelines? Does this make the prospect of AGI a lot nearer? I’m sorry, I’m just a lay person, but this has got me more scared than anything else. So now AI can finally efficiently build itself?
It shortens timelines if you were surprised by how impressive this was, and how few impressive insights seem required to give big performance increases. But the how big an impact this has depends on your priors.
To me, this thing isn’t that much more impressive than Codex in the kinds of problems it solved. It can perform some simple algorithms if you give it enough tries to a simple problem, but that is a long way away from being able to modify the kind of code it is made up of. Or even just producing anywhere near peak human insights into algorithms.
Now you might wonder if that is just a result of the model being too small, the dataset too tiny or whatever. But I don’t think so. The changes they introduced are not going to make efficient use of compute and it seems like it is probably at the point where it won’t improve with much more data. Doubling all the models resources and size is not going to get you double, or probably even a 50% increase in results. So I’m not that much more worried i.e. my timelines didn’t just get cut in half by this.
Nice. In what measure would you say it affected your timelines, like a rough approximation? And what would your timelines be at this point? I know this is hard to tell so no offense if you don’t wanna answer, but I’m indeed curious.
I don’t really know. I guess I’d say I feel like it might have shortened timelines by a few months to a year, but I think I had longer time horizons than most people on Lesswrong. I know that’s not that helpful, but I don’t think I have an explicit date of when I think catastrophe will arrive.
What do you mean by “give it enough tries”?
That is the sampling part of AlphaGo. When you ask it a question, if you let take more and more samples as potential solutions, the more likely you are to get a correct answer. The success rate of this seems to grow log linearly with the number of samples, which is pretty poor.
Oh, I see, I thought you meant AlphaCode had some sort of recurrency (does it?), i.e. iterating over its answer like a human would when debugging code. Log-linear doesn’t sound bad, you wouldn’t expect a human to be able to write a correct program on the first try without debugging either, would you?
It is pretty bad because the constant matters here. If you have a method which achieves a success rate % of C log( 1+ k) where k is the number of samples then, if C =1, you’d need 2^100~10^30 many samples per inference to get all the competitive programing questions right. As it stands, C seems pretty small. For k=100000 you get solve maybe 33% of problems. And it isn’t much better than k=1000.
That’s leaving aside the question of whether you will continue to get log-linear improvements, which I find fairly unlikely. Though if you could get C to something like 10 or 20 with one or two innovations, I’d be scared or even terrified.
To check my understanding:
If C was 20, then 100,000 samples would be enough to solve approximately 100% of these problems?
But instead C is something like 6-7, because 100,000 samples gets you to 33%?
Yeah, basically, but I was using log_2(x) instead of log_10(x) in my made up example. Here’s some actual data. What they mean by “unlimited attempts” is that there is no filter on the sample solutions, so they are all submitted to codebase. I expect false positives/incredibly slow to be more likely than performance actually increasing without limit.
Did you mix up AlphaFold and AlphaCode?
Yes, that happened a couple of times as I was writing things. Thanks.
I didn’t get how is this comparison supposed to be meaningful. The main constraint for human programmers in code contests is the time limit. I mean, the last challenge on Codeforces gives you 2 hours and 30 minutes to solve 6 problems, which is not exactly a ton of time. You need to be a very good programmer to completely solve all the problems within the deadline, but any decent programmer should be able to solve the same problems within a week. I would claim that having more time to write dramatically improves your chance of producing correct code, in a way that having more time to make a chess move does not.
We can argue that they try to model this with the limit of 10 submissions per problem, but following the link I read this:
It seems that the analogue of “give it plenty of time to write” makes the rank shift from top 54% to top 49,7%. Which is… not incredibly impressive? Did I forgot to read some details that would make the comparison more meaningful?
The 10 submissions per problem is meaningful, I think. I can see myself making 10 submissions to check my code against test cases. But I agree that the peak performance can require a lot more submissions, and make it incomparable to humans in this context. But you know, if you eliminate the time limit for some of these problems, I can see some people never getting the solution. So I don’t think this is too different. But I do expect the AI to perform worse in the no time limit case.