By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Dwarkesh: Jack Cole with a 240 million parameter model got 35% [on ARC]. Doesn’t that suggest that they’re on this spectrum that clearly exists within humans, and they’re going to be saturated pretty soon?
[...]
Chollet: One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now. Most of the time when you’re using an LLM, it’s just doing static inference. The model is frozen. You’re just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.
What Jack Cole is actually doing is that for every test problem, it’s on-the-fly fine-tuning a version of the LLM for that task. That’s really what’s unlocking performance. If you don’t do that, you get like 1-2%, something completely negligible. If you do test time fine-tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers.
What it’s doing is trying to address one of the key limitations of LLMs today: the lack of active inference. It’s actually adding active inference to LLMs. That’s working extremely well, actually. So that’s fascinating to me.
you might say that it has an internal program that allows it to be faster and more accurate than a child
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Let’s start with the end:
Why do you think that they don’t already do that?
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
As per “Transformers learn in-context by gradient descent”, which Gwern also mentions in the comment that @quetzal_rainbow links here.