CoT prompting and agentic behavior are basically supplying System 2 thinking. Currently LLMs tend to use and benefit from them for a little while, then sooner or later go off the rails/get caught in a loop/get confused, and are seldom able to get unstuck when they do. What we need is for them to be able to much more reliably carry out abilities that they have already demonstrated: which is bread-and-butter for scaling. So I don’t see System 2 thinking as a blocker, just work-in-progress. It might take a few years.
As for the ARC challenge, it clearly requires a visual LLM, so systems capable of attempting it have only really existed for about 18 months. My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn’t seem like it would be that hard. I’d guess the main problem is the shortage of visual puzzle training material for tasks like this in most training sets.
My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn’t seem like it would be that hard.
His argument is that with millions of examples of these puzzles, you can train an LLM to be good at this particular task, but that doesn’t mean reasoning if it fails at a similar task it doesn’t see. He thinks you should be able to train an LLM to do this without ever training on tasks like these.
I can buy this argument, but still have some doubts. It may be this reasoning is just derived from visual training data + spending more time per token reliably, or he is right and LLMs are fundamentally terrible at abstract reasoning. I think it would be nice to know what’s the youngest a human can be and still solve this. Might give us a sense of the “training data” a human needs to get there.
Some caveats: humans can only get 85% on the public test set I believe. This is to say nothing about the difficulty of the private test set. Maybe it’s harder, tho I doubt it since it would go against what he claims is the spirit of the benchmark.
CoT prompting and agentic behavior are basically supplying System 2 thinking. Currently LLMs tend to use and benefit from them for a little while, then sooner or later go off the rails/get caught in a loop/get confused, and are seldom able to get unstuck when they do. What we need is for them to be able to much more reliably carry out abilities that they have already demonstrated: which is bread-and-butter for scaling. So I don’t see System 2 thinking as a blocker, just work-in-progress. It might take a few years.
As for the ARC challenge, it clearly requires a visual LLM, so systems capable of attempting it have only really existed for about 18 months. My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn’t seem like it would be that hard. I’d guess the main problem is the shortage of visual puzzle training material for tasks like this in most training sets.
His argument is that with millions of examples of these puzzles, you can train an LLM to be good at this particular task, but that doesn’t mean reasoning if it fails at a similar task it doesn’t see. He thinks you should be able to train an LLM to do this without ever training on tasks like these.
I can buy this argument, but still have some doubts. It may be this reasoning is just derived from visual training data + spending more time per token reliably, or he is right and LLMs are fundamentally terrible at abstract reasoning. I think it would be nice to know what’s the youngest a human can be and still solve this. Might give us a sense of the “training data” a human needs to get there.
Some caveats: humans can only get 85% on the public test set I believe. This is to say nothing about the difficulty of the private test set. Maybe it’s harder, tho I doubt it since it would go against what he claims is the spirit of the benchmark.