First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...
these benchmarks don’t feel genuine.
Chollet indicated in his piece:
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)
The model was tuned to the ARC AGI Test, got a great score but then faceplanted on a reasoning test apparently easier for humans than the first one, a test that couldn’t have been adversarially designed to stump o3. I would have expected that this would immediately expose it as being horrifically overfit but few people seem to be honing in on that so maybe I don’t understand something?
Second, the Frontiers math problems appear to be structured in a way that 25% of the problems are the sort that clever high schoolers with bright futures in mathematics would be able to answer if they studied, International Math Olympiad or undergrad level questions. We don’t know EXACTLY which questions it answered right but it’s almost exactly 25% and I suspect that these are almost entirely from the lower tier questions, and I wouldn’t be surprised to hear that the questions (or similar questions) were in the training set. Perhaps the lowest tier questions were formulated with a sort of operational philosophy that didn’t prioritize guarding against leaked data?
Third, the codeforces ELO doesn’t mean anything. I just can’t take this seriously, unless someone thinks that existing models are already mid tier competitive SWE’s? Similar benchmark, similarly meaningless, dismissed until someone shows me that these can actually deliver on what other models have provenly exaggerated on.
Fourth, the cost is obscene: thousands of dollars per task, with indications that this model is very similar to o1 going by cost per token, this looks less like a strong model with way more parameters and more like the same model doing a mind boggling amount of thinking. This is probably something like funsearch, a tool bag of programs that an LLM then combines and tries to gauge the effectiveness of, brute forced until it can get an answer it can verify. This seems useful, but this would only work on close-ended questions with answers that are easy to verify, either way this wouldn’t really be intelligence of the kind that I had imagined looking for.
This would PERFECTLY explain the failure on the ARC-AGI2 benchmark: the bag of tools it would need would be different, it wasn’t tuned to the new test and came with the wrong tool bag. Maybe this could be fixed, but if my model of how this AI works is right then the complexity of tasks would increase by O(n!) with n being the number of “tools” it needs. I’m probably wrong here but something LIKE that is probably true.
Lecun also seems to be confident on threads that this is NOT an LLM, that this is something that uses an LLM but that something else is going on. Perfectly matched my “oh, this is funsearch” intuition. My caveat is that this might all be handled “in house” in an LLM, but the restrictions on what this could do seem very real.
Am a critically wrong on enough points here that I should seriously rethink my intuition?
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can’t naively translate benchmark scores to real-world capabilities.
Thank you for the warm reply, it’s nice and also good feedback I didn’t do anything explicitly wrong with my post
It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn’t make us wait so long to test that :/
On one side, as you point out, it would mean that the model’s single pass reasoning did not improve much (or at all).
On the other side, it would also mean that you can get large performance and reliability gains (on specific benchmarks) by just adding simple stuff. This is significant because you can do this much more quickly than the time it takes to train a new base model, and there’s probably more to be gained in that direction – similar tricks we can add by hardcoding various “system-2 loops” into the AI’s chain of thought and thinking process.
You might reply that this only works if the benchmark in question has easily verifiable answers. But I don’t think it is limited to those situations. If the model itself (or some subroutine in it) has some truth-tracking intuition about which of its answer attempts are better/worse, then running it through multiple passes and trying to pick the best ones should get you better performance even without easy and complete verifiability (since you can also train on the model’s guesses about its own answer attempts, improving its intuition there).
Besides, I feel like humans do something similar when we reason: we think up various ideas and answer attempts and run them by an inner critic, asking “is this answer I just gave actually correct/plausible?” or “is this the best I can do, or am I missing something?.”
(I’m not super confident in all the above, though.)
Lastly, I think the cost bit will go down by orders of magnitude eventually (I’m confident of that). I would have to look up trends to say how quickly I expect $4,000 in runtime costs to go down to $40, but I don’t think it’s all that long. Also, if you can do extremely impactful things with some model, like automating further AI progress on training runs that cost billions, then willingness to pay for model outputs could be high anyway.
My intuition is still LLM pessimistic, I’d be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!
First post, feel free to meta-critique my rigor on this post as I am not sure what is mandatory, expected, or superfluous for a comment under a post. Studying computer science but have no degree, yet. Can pull the specific citation if necessary, but...
these benchmarks don’t feel genuine.
Chollet indicated in his piece:
The model was tuned to the ARC AGI Test, got a great score but then faceplanted on a reasoning test apparently easier for humans than the first one, a test that couldn’t have been adversarially designed to stump o3. I would have expected that this would immediately expose it as being horrifically overfit but few people seem to be honing in on that so maybe I don’t understand something?
Second, the Frontiers math problems appear to be structured in a way that 25% of the problems are the sort that clever high schoolers with bright futures in mathematics would be able to answer if they studied, International Math Olympiad or undergrad level questions. We don’t know EXACTLY which questions it answered right but it’s almost exactly 25% and I suspect that these are almost entirely from the lower tier questions, and I wouldn’t be surprised to hear that the questions (or similar questions) were in the training set. Perhaps the lowest tier questions were formulated with a sort of operational philosophy that didn’t prioritize guarding against leaked data?
Third, the codeforces ELO doesn’t mean anything. I just can’t take this seriously, unless someone thinks that existing models are already mid tier competitive SWE’s? Similar benchmark, similarly meaningless, dismissed until someone shows me that these can actually deliver on what other models have provenly exaggerated on.
Fourth, the cost is obscene: thousands of dollars per task, with indications that this model is very similar to o1 going by cost per token, this looks less like a strong model with way more parameters and more like the same model doing a mind boggling amount of thinking. This is probably something like funsearch, a tool bag of programs that an LLM then combines and tries to gauge the effectiveness of, brute forced until it can get an answer it can verify. This seems useful, but this would only work on close-ended questions with answers that are easy to verify, either way this wouldn’t really be intelligence of the kind that I had imagined looking for.
This would PERFECTLY explain the failure on the ARC-AGI2 benchmark: the bag of tools it would need would be different, it wasn’t tuned to the new test and came with the wrong tool bag. Maybe this could be fixed, but if my model of how this AI works is right then the complexity of tasks would increase by O(n!) with n being the number of “tools” it needs. I’m probably wrong here but something LIKE that is probably true.
Lecun also seems to be confident on threads that this is NOT an LLM, that this is something that uses an LLM but that something else is going on. Perfectly matched my “oh, this is funsearch” intuition. My caveat is that this might all be handled “in house” in an LLM, but the restrictions on what this could do seem very real.
Am a critically wrong on enough points here that I should seriously rethink my intuition?
Welcome!
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can’t naively translate benchmark scores to real-world capabilities.
Thank you for the warm reply, it’s nice and also good feedback I didn’t do anything explicitly wrong with my post
It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn’t make us wait so long to test that :/
Well, the update for me would go both ways.
On one side, as you point out, it would mean that the model’s single pass reasoning did not improve much (or at all).
On the other side, it would also mean that you can get large performance and reliability gains (on specific benchmarks) by just adding simple stuff. This is significant because you can do this much more quickly than the time it takes to train a new base model, and there’s probably more to be gained in that direction – similar tricks we can add by hardcoding various “system-2 loops” into the AI’s chain of thought and thinking process.
You might reply that this only works if the benchmark in question has easily verifiable answers. But I don’t think it is limited to those situations. If the model itself (or some subroutine in it) has some truth-tracking intuition about which of its answer attempts are better/worse, then running it through multiple passes and trying to pick the best ones should get you better performance even without easy and complete verifiability (since you can also train on the model’s guesses about its own answer attempts, improving its intuition there).
Besides, I feel like humans do something similar when we reason: we think up various ideas and answer attempts and run them by an inner critic, asking “is this answer I just gave actually correct/plausible?” or “is this the best I can do, or am I missing something?.”
(I’m not super confident in all the above, though.)
Lastly, I think the cost bit will go down by orders of magnitude eventually (I’m confident of that). I would have to look up trends to say how quickly I expect $4,000 in runtime costs to go down to $40, but I don’t think it’s all that long. Also, if you can do extremely impactful things with some model, like automating further AI progress on training runs that cost billions, then willingness to pay for model outputs could be high anyway.
I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently
I believe you’re right we do something similar to the LLM’s (loosely, analogously), see
https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble
(I need to learn markdown)
My intuition is still LLM pessimistic, I’d be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!