I’m not questioning whether o3 is a big advance over previous models—it obviously is! I was trying to address some suggestions / vibe in the air (example) that o3 is strong evidence that the singularity is nigh, not just that there is rapid ongoing AI progress. In that context, I haven’t seen people bringing up SWE-bench as much as those other three that I mentioned, although it’s possible I missed it. Mostly I see people bringing up SWE-bench in the context of software jobs.
I was figuring that the SWE-bench tasks don’t seem particularly hard, intuitively. E.g. 90% of SWE-bench verified problems are “estimated to take less than an hour for an experienced software engineer to complete”. And a lot more people have the chops to become an “experienced software engineer” than to become able to solve FrontierMath problems or get in the top 200 in the world on Codeforces. So the latter sound extra impressive, and that’s what I was responding to.
I mean, fair but when did a benchmark designed to test REAL software engineering issues that take less than an hour suddenly stop seeming “particularly hard” for a computer.
I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.
I think that’s useful! LLM’s obviously encode a ton of useful algorithms and can chain them together reasonably well
But I’ve tried to get those bastards to do something slightly weird and they just totally self destruct.
But let’s just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we’ve seen have stuck to highly simple, well documented takes that don’t vary all that much. The benchmarks here have been meaningless from the start and without evidence we should assume increments on them is equally meaningless
The lying liar company run by liars that lie all the time probably lied here and we keep falling for it like Wiley Coyote
It’s been pretty clear to me as someone who regularly creates side projects with ai that the models are actually getting better at coding.
Also, it’s clearly not pure memorization, you can deliberately give them tasks that have never been done before and they do well.
However, even with agentic workflows, rag, etc all existing models seem to fail at some moderate level of complexity—they can create functions and prototypes but have trouble keeping track of a large project
My uninformed guess is that o3 actually pushes the complexity by some non-trivial amount, but not enough to now take on complex projects.
Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:
I’m sure they’ve gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they’ve been getting better
They can do incredibly stuff in well documented processes but don’t survive well off the trodden path. They seem to string things together pretty well so I don’t know if I would say there’s nothing else going on besides memorization but it seems to be a lot of what it’s doing, like it’s working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you’d expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have
See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode… Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.
This was 4, but I don’t think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.
Even if I stop seeing it myself, I’m going to assume that the problem is still there and just getting better at hiding unless there’s a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies
I don’t think you can explain away SWE-bench performance with any of these explanations
I’m not questioning whether o3 is a big advance over previous models—it obviously is! I was trying to address some suggestions / vibe in the air (example) that o3 is strong evidence that the singularity is nigh, not just that there is rapid ongoing AI progress. In that context, I haven’t seen people bringing up SWE-bench as much as those other three that I mentioned, although it’s possible I missed it. Mostly I see people bringing up SWE-bench in the context of software jobs.
I was figuring that the SWE-bench tasks don’t seem particularly hard, intuitively. E.g. 90% of SWE-bench verified problems are “estimated to take less than an hour for an experienced software engineer to complete”. And a lot more people have the chops to become an “experienced software engineer” than to become able to solve FrontierMath problems or get in the top 200 in the world on Codeforces. So the latter sound extra impressive, and that’s what I was responding to.
I mean, fair but when did a benchmark designed to test REAL software engineering issues that take less than an hour suddenly stop seeming “particularly hard” for a computer.
Feels like we’re being frogboiled.
I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.
I think that’s useful! LLM’s obviously encode a ton of useful algorithms and can chain them together reasonably well
But I’ve tried to get those bastards to do something slightly weird and they just totally self destruct.
But let’s just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we’ve seen have stuck to highly simple, well documented takes that don’t vary all that much. The benchmarks here have been meaningless from the start and without evidence we should assume increments on them is equally meaningless
The lying liar company run by liars that lie all the time probably lied here and we keep falling for it like Wiley Coyote
It’s been pretty clear to me as someone who regularly creates side projects with ai that the models are actually getting better at coding.
Also, it’s clearly not pure memorization, you can deliberately give them tasks that have never been done before and they do well.
However, even with agentic workflows, rag, etc all existing models seem to fail at some moderate level of complexity—they can create functions and prototypes but have trouble keeping track of a large project
My uninformed guess is that o3 actually pushes the complexity by some non-trivial amount, but not enough to now take on complex projects.
Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:
I’m sure they’ve gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they’ve been getting better
They can do incredibly stuff in well documented processes but don’t survive well off the trodden path. They seem to string things together pretty well so I don’t know if I would say there’s nothing else going on besides memorization but it seems to be a lot of what it’s doing, like it’s working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you’d expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have
See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode… Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.
https://x.com/cHHillee/status/1635790330854526981?t=tGRu60RHl6SaDmnQcfi1eQ&s=19
This was 4, but I don’t think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.
Even if I stop seeing it myself, I’m going to assume that the problem is still there and just getting better at hiding unless there’s a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies