This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
This is the first model where we have strong evidence that the LLM is actually reasoning/generalizing and not just memorizing it’s data.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
True enough, and I should probably rewrite the claim.
Though what was the logic problem that was solved without memorization.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
Yeah, it’s looking like GPT-o1 is just quantitatively better at generalizing compared to GPT-3, not qualitatively better.