The approach of just generative self-supervised learning on existing source code corpuses is picking the low-hanging fruit. As impressive as it is to see Codex just knock out a web scraper, coding is very much a colonization wave sort of place: standalone code is fine, but the bulk of the work has always been maintenance and debugging of existing systems, not spitting out little self-contained scripts. Because of this asymmetry, Codex is a meaningful step towards UFAI, but is a smaller step in terms of automating programmers.
To elaborate on this a little more: maintenance is the kind of nasty field where ’99% accurate’ may still not be nearly good enough if you want to unlock big productivity gains of the sort you get by replacing humans entirely, rather than merely saving a few minutes here or there looking up API docs etc. Amdahl’s law is not mocked: if a human has to manually review and step in, then it cannot deliver more than modest factor gains, any more than learning to type really fast will deliver life-changing productivity gains. Maintenance is almost by definition about the long tail of subtle bugs, system interactions, faulty assumptions, and business-driven requirement changes.* If you’re a SWE at Google, you don’t spend very much time writing little self-contained greenfield scripts of 100-500 lines. You’ll spend a lot more time doing, say, code reviews of new pulls, which involve no writing of greenfield code at all. Something like Codex can help knock out the occasional script or help in learning a new system or be a very useful substrate for static analysis tools (like Coverity on steroids), but I can confidently predict that Codex is not going to make programmers even 10x more productive. Utility doesn’t increase smoothly with accuracy: it plateaus and jumps. You don’t want to use a voice transcription system which makes 10% errors, but at 5% it might suddenly become useful.
But ironically, in many ways, developing DL code is far simpler. Sometimes, solving a much harder problem is much easier. DL is much more self-contained and amenable to self-modification. The complexity of the learned tasks resides in the weights, not the seed algorithm which learns the NN; the seed algorithm may be extremely simple and short, a few hundred lines at most, including all the boilerplate and wrappers and ceremony. You can write backprop and CNNs in a few hundred lines for a self-contained CPU implementation. Available DL libraries let you create & train an arch like GPT in a few dozen lines (Karpathy does minGPT in <300 lines of bloated code). Rip Van Winkle is an interesting exercise in estimating complexity, in a Kolmogorov sort of way, of a formerly-SOTA CNN ResNet at 1,032 bits. Evolutionary search programs like AutoML-Zero can recapitulate backprop and other core algorithms in a few lines. We also see this in the breakthroughs themselves: why do MLPs suddenly work? Because you add like 1 line to re-normalize or gate intermediate activations, while 99.9% of the code remains the same. Why did resnets suddenly make ‘deep’ (>10) layer NNs work? Because you add like 1-3 lines to define a shortcut connection. Why did NNs suddenly start working around 2009? Because you added 1 line for the right initialization, and 1 line for a ReLU instead of sigmoid nonlinearity. Why did X work—we could go on all day. (Why is one person a genius and another person ordinary? Differences at a few thousand alleles which could be encoded in less than a kilobyte. Why did humans take over the world and chimpanzees are in zoos if our genomes are like 99.9% identical? Everything is fragile.) The space of all possible programs of a few hundred self-contained lines to bootstrap a general meta-learning agent is vast… but it’s also exactly the sort of task where a self-supervised agent can acquire most of the necessary bits from the environment, solving basic problems like how to create valid ASTs (the sort of knowledge that isn’t in AutoML-Zero-esque systems, and mostly accounts for their boil-the-ocean inefficiency), and then use the tiny bit of supervision from evolutionary RL losses to close the gap by selecting only plausible modifications to test, running a feasible number of iterations, and modifying the last handful of key lines.
Thus, an asymmetry in code-generating AIs. A code-generating AI could be almost completely useless for ‘easy’ maintenance tasks like fixing bugs in production code because it comes with so much overhead and unreliability that it isn’t worth the hassle, but also still offer enormous exponential gains in ranking candidates for the ‘hard’ problem of rewriting a core DL algorithm. It is unfortunate that we live in a world where you can apparently be 99.9% of the way to a human or an AGI and the result be completely useless, rather than 99.9% as powerful, because it means you may get no warning signs before that last 1-line fix; but that looks like the world we live in, as opposed to a gradualist world where half-working AIs take over half the global economy or something.
* If you’ve paid attention to the popups on Gwern.net, you’ve probably noticed that they’ve changed a number of times; the Wikipedia popups, specifically, have now gone through 8 completely different implementations. The 8th iteration, ironically, is very similar to the 1st iteration: it requests from the Wikipedia APIs an article summary and displays it; that’s all. I & Obormot have spent a breathtaking amount of time on this, not because the actual coding itself takes up substantial time (none of it is remotely impressive algorithmically), but because the hard part is understanding what even should be done in the first place and what tradeoff between static, dynamic, inlined vs external, popup vs popin etc works best, implementing and testing in the real world to see how it felt in practice and what users thought, how it scaled as I fixed bugs & found edge-cases… By the 8th iteration, what we’d learned was that static or inlined couldn’t work at scale or provide recursion in any feasible way and were deadends, and the main motivation for those—displaying hyperlinked excerpts—was moot because we were using the wrong WP API in the first iteration, and there was a ‘mobile’ API which, I discovered after hours of docs reading, provided useful rather than butchered excerpts and worked fine all along. “Time is a circle.”
interesting reading this 3 years later. I occasionally paste a bug report directly into cursor and provided I am right about which file the bug is in, it often one-shots them. i remain confused about why rsi isn’t critical by now.
There’s a finetuned model now too: “Genji is a transformer model finetuned on EleutherAI’s GPT-J 6B model. This particular model is trained on Python-only code approaching 4GB in size.”
Thanks! Any thoughts on Codex? Do you think insane progress in code generation will continue for at least a few years?
The approach of just generative self-supervised learning on existing source code corpuses is picking the low-hanging fruit. As impressive as it is to see Codex just knock out a web scraper, coding is very much a colonization wave sort of place: standalone code is fine, but the bulk of the work has always been maintenance and debugging of existing systems, not spitting out little self-contained scripts. Because of this asymmetry, Codex is a meaningful step towards UFAI, but is a smaller step in terms of automating programmers.
To elaborate on this a little more: maintenance is the kind of nasty field where ’99% accurate’ may still not be nearly good enough if you want to unlock big productivity gains of the sort you get by replacing humans entirely, rather than merely saving a few minutes here or there looking up API docs etc. Amdahl’s law is not mocked: if a human has to manually review and step in, then it cannot deliver more than modest factor gains, any more than learning to type really fast will deliver life-changing productivity gains. Maintenance is almost by definition about the long tail of subtle bugs, system interactions, faulty assumptions, and business-driven requirement changes.* If you’re a SWE at Google, you don’t spend very much time writing little self-contained greenfield scripts of 100-500 lines. You’ll spend a lot more time doing, say, code reviews of new pulls, which involve no writing of greenfield code at all. Something like Codex can help knock out the occasional script or help in learning a new system or be a very useful substrate for static analysis tools (like Coverity on steroids), but I can confidently predict that Codex is not going to make programmers even 10x more productive. Utility doesn’t increase smoothly with accuracy: it plateaus and jumps. You don’t want to use a voice transcription system which makes 10% errors, but at 5% it might suddenly become useful.
But ironically, in many ways, developing DL code is far simpler. Sometimes, solving a much harder problem is much easier. DL is much more self-contained and amenable to self-modification. The complexity of the learned tasks resides in the weights, not the seed algorithm which learns the NN; the seed algorithm may be extremely simple and short, a few hundred lines at most, including all the boilerplate and wrappers and ceremony. You can write backprop and CNNs in a few hundred lines for a self-contained CPU implementation. Available DL libraries let you create & train an arch like GPT in a few dozen lines (Karpathy does minGPT in <300 lines of bloated code). Rip Van Winkle is an interesting exercise in estimating complexity, in a Kolmogorov sort of way, of a formerly-SOTA CNN ResNet at 1,032 bits. Evolutionary search programs like AutoML-Zero can recapitulate backprop and other core algorithms in a few lines. We also see this in the breakthroughs themselves: why do MLPs suddenly work? Because you add like 1 line to re-normalize or gate intermediate activations, while 99.9% of the code remains the same. Why did resnets suddenly make ‘deep’ (>10) layer NNs work? Because you add like 1-3 lines to define a shortcut connection. Why did NNs suddenly start working around 2009? Because you added 1 line for the right initialization, and 1 line for a ReLU instead of sigmoid nonlinearity. Why did X work—we could go on all day. (Why is one person a genius and another person ordinary? Differences at a few thousand alleles which could be encoded in less than a kilobyte. Why did humans take over the world and chimpanzees are in zoos if our genomes are like 99.9% identical? Everything is fragile.) The space of all possible programs of a few hundred self-contained lines to bootstrap a general meta-learning agent is vast… but it’s also exactly the sort of task where a self-supervised agent can acquire most of the necessary bits from the environment, solving basic problems like how to create valid ASTs (the sort of knowledge that isn’t in AutoML-Zero-esque systems, and mostly accounts for their boil-the-ocean inefficiency), and then use the tiny bit of supervision from evolutionary RL losses to close the gap by selecting only plausible modifications to test, running a feasible number of iterations, and modifying the last handful of key lines.
Thus, an asymmetry in code-generating AIs. A code-generating AI could be almost completely useless for ‘easy’ maintenance tasks like fixing bugs in production code because it comes with so much overhead and unreliability that it isn’t worth the hassle, but also still offer enormous exponential gains in ranking candidates for the ‘hard’ problem of rewriting a core DL algorithm. It is unfortunate that we live in a world where you can apparently be 99.9% of the way to a human or an AGI and the result be completely useless, rather than 99.9% as powerful, because it means you may get no warning signs before that last 1-line fix; but that looks like the world we live in, as opposed to a gradualist world where half-working AIs take over half the global economy or something.
* If you’ve paid attention to the popups on Gwern.net, you’ve probably noticed that they’ve changed a number of times; the Wikipedia popups, specifically, have now gone through 8 completely different implementations. The 8th iteration, ironically, is very similar to the 1st iteration: it requests from the Wikipedia APIs an article summary and displays it; that’s all. I & Obormot have spent a breathtaking amount of time on this, not because the actual coding itself takes up substantial time (none of it is remotely impressive algorithmically), but because the hard part is understanding what even should be done in the first place and what tradeoff between static, dynamic, inlined vs external, popup vs popin etc works best, implementing and testing in the real world to see how it felt in practice and what users thought, how it scaled as I fixed bugs & found edge-cases… By the 8th iteration, what we’d learned was that static or inlined couldn’t work at scale or provide recursion in any feasible way and were deadends, and the main motivation for those—displaying hyperlinked excerpts—was moot because we were using the wrong WP API in the first iteration, and there was a ‘mobile’ API which, I discovered after hours of docs reading, provided useful rather than butchered excerpts and worked fine all along. “Time is a circle.”
interesting reading this 3 years later. I occasionally paste a bug report directly into cursor and provided I am right about which file the bug is in, it often one-shots them. i remain confused about why rsi isn’t critical by now.