This seems to be as good of a place as any to post my unjustified predictions on this topic, the second of which I have a bet outstanding on at even odds.
Devin will turn out to be just a bunch of GPT-3.5/4 calls and a pile of prompts/heuristics/scaffolding so disgusting and unprincipled only a team of geniuses could have created it.
Someone will create an agent that gets 80%+ on SWE-Bench within six months.
I am not sure if 1. being true or false is good news. Both suggest we should update towards large jumps in coding ability very soon.
Regarding RSI, my intuition has always been that automating AI research will likely be easier than automating the development and maintenance of a large app like, say, Photoshop, So I don’t expect fire alarms like “non-gimmicky top 10 app on AppStore was developed entirely autonomously” before doom.
Someone will create an agent that gets 80%+ on SWE-Bench within six months.
I think this is probably above the effective cap on the current implementation of SWE-bench (where you can’t see test cases) because often test cases are specific to the implementation.
E.g. the test cases assume that a given method was named a particular thing even though the task description doesn’t specify.
Your timeline was off, but I think your original comment will turn out to have had the right idea. Given the leaps from GPT-3.5 to GPT-4 to Devin to Claude 3.5-Sonnet w/scaffolding, marginal seeming updates to models are turning out to be quite substantial in effective capability. It’s hard to create evaluation harnesses for fuzzy, abstract things like the syntax complexity models can handle, and those abilities do transfer to using the models to automate their own post-training tasks, e.g. like what the self-alignment backtranslation paper’s scaling charts showed. The better the model, the better they accomplish these tasks with worse task definitions and less help. The piles of prompts necessary for current agents will be less and less necessary, at some point generated on the fly to meek descriptions like “make a code agent to do ${task}” by the models themselves. Whatever human effort will go into the next generation of unprincipled scaffolding will provide yet greater returns to future models. These factors combined, I expect SWE-Bench progress to be discontinuous and rapid, as it has been so far.
A very naive extrapolation using polynomial regression from SWE-Bench scores suggests ≥80% by November 2025. I used model release dates for my x-value. Important to note models may be contaminated too.
1) …a pile of prompts/heuristics/scaffolding so disgusting and unprincipled only a team of geniuses could have created it
I chuckled out loud over this. Too real.
Also, regarding that second point, how to you plan to adjudicate the bet? It is worded as “create” here, but what can actually be seen to settle the bet will be the effects.
There are rumors coming out of Google including names like “AlphaCode” and “Goose” that suggest they might have already created such a thing, or be near to it. Also, one of the criticisms of Devin (and Devin’s likelihood of getting better fast) was that if someone really did crack the problem then they’d just keep the cow and sell the milk. Critch’s “tech company singularity” scenario comes to mind.
The bet is with a friend and I will let him judge.
I agree that providing an api to God is a completely mad strategy and we should probably expect less legibility going forward. Still, we have no shortage of ridiculously smart people acting completely mad.
This seems to be as good of a place as any to post my unjustified predictions on this topic, the second of which I have a bet outstanding on at even odds.
Devin will turn out to be just a bunch of GPT-3.5/4 calls and a pile of prompts/heuristics/scaffolding so disgusting and unprincipled only a team of geniuses could have created it.
Someone will create an agent that gets 80%+ on SWE-Bench within six months.
I am not sure if 1. being true or false is good news. Both suggest we should update towards large jumps in coding ability very soon.
Regarding RSI, my intuition has always been that automating AI research will likely be easier than automating the development and maintenance of a large app like, say, Photoshop, So I don’t expect fire alarms like “non-gimmicky top 10 app on AppStore was developed entirely autonomously” before doom.
I think this is probably above the effective cap on the current implementation of SWE-bench (where you can’t see test cases) because often test cases are specific to the implementation.
E.g. the test cases assume that a given method was named a particular thing even though the task description doesn’t specify.
I officially lost the bet and paid up. Amusingly, SWE-Bench is so broken it was likely impossible for me to win. Though I would have lost in any case.
If you make a bet about a benchmark, probably you should understand it deeply and not just bet on vibes, ha!
Not looking good for my prediction: https://www.swebench.com/
Your timeline was off, but I think your original comment will turn out to have had the right idea. Given the leaps from GPT-3.5 to GPT-4 to Devin to Claude 3.5-Sonnet w/scaffolding, marginal seeming updates to models are turning out to be quite substantial in effective capability. It’s hard to create evaluation harnesses for fuzzy, abstract things like the syntax complexity models can handle, and those abilities do transfer to using the models to automate their own post-training tasks, e.g. like what the self-alignment backtranslation paper’s scaling charts showed. The better the model, the better they accomplish these tasks with worse task definitions and less help. The piles of prompts necessary for current agents will be less and less necessary, at some point generated on the fly to meek descriptions like “make a code agent to do ${task}” by the models themselves. Whatever human effort will go into the next generation of unprincipled scaffolding will provide yet greater returns to future models. These factors combined, I expect SWE-Bench progress to be discontinuous and rapid, as it has been so far.
A very naive extrapolation using polynomial regression from SWE-Bench scores suggests ≥80% by November 2025. I used model release dates for my x-value. Important to note models may be contaminated too.
Yeah, I expect it to fall soon but I will lose my bet if it doesn’t happen in a month.
I chuckled out loud over this. Too real.
Also, regarding that second point, how to you plan to adjudicate the bet? It is worded as “create” here, but what can actually be seen to settle the bet will be the effects.
There are rumors coming out of Google including names like “AlphaCode” and “Goose” that suggest they might have already created such a thing, or be near to it. Also, one of the criticisms of Devin (and Devin’s likelihood of getting better fast) was that if someone really did crack the problem then they’d just keep the cow and sell the milk. Critch’s “tech company singularity” scenario comes to mind.
The bet is with a friend and I will let him judge.
I agree that providing an api to God is a completely mad strategy and we should probably expect less legibility going forward. Still, we have no shortage of ridiculously smart people acting completely mad.
I put ~5% on the part I selected, but there is no 5% emoji, so I thought I will mention this using a short comment.