Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn’t generalize, no?
The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL’d the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it’s actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.
The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it’s just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting
This just feels like a nothingburger and I’m waiting for someone to tell me why my doubts are misplaced, convincingly
Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn’t generalize, no?
The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL’d the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it’s actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.
The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it’s just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting
This just feels like a nothingburger and I’m waiting for someone to tell me why my doubts are misplaced, convincingly