Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:
I’m sure they’ve gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they’ve been getting better
They can do incredibly stuff in well documented processes but don’t survive well off the trodden path. They seem to string things together pretty well so I don’t know if I would say there’s nothing else going on besides memorization but it seems to be a lot of what it’s doing, like it’s working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you’d expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have
See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode… Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.
https://x.com/cHHillee/status/1635790330854526981?t=tGRu60RHl6SaDmnQcfi1eQ&s=19
This was 4, but I don’t think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.
Even if I stop seeing it myself, I’m going to assume that the problem is still there and just getting better at hiding unless there’s a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies
Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn’t generalize, no?
The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL’d the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it’s actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.
The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it’s just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting
This just feels like a nothingburger and I’m waiting for someone to tell me why my doubts are misplaced, convincingly