Okay, hot take: I don’t think that ARC tests “system 2 reasoning” and “solving novel tasks”, at least, in humans. When I see simple task, I literally match patterns, when I see complex task I run whatever patterns I can invent until they match. I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
They’ve tested ARC with children and Mechanical Turk workers, and they all seem to do fine despite the average person not being a fan of “Myst-esque games.”
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
Do you believe LLMs are just a few OOMs away from solving novel tasks like ARC? What is different that is not explained by what Chollet is saying?
By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Dwarkesh: Jack Cole with a 240 million parameter model got 35% [on ARC]. Doesn’t that suggest that they’re on this spectrum that clearly exists within humans, and they’re going to be saturated pretty soon?
[...]
Chollet: One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now. Most of the time when you’re using an LLM, it’s just doing static inference. The model is frozen. You’re just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.
What Jack Cole is actually doing is that for every test problem, it’s on-the-fly fine-tuning a version of the LLM for that task. That’s really what’s unlocking performance. If you don’t do that, you get like 1-2%, something completely negligible. If you do test time fine-tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers.
What it’s doing is trying to address one of the key limitations of LLMs today: the lack of active inference. It’s actually adding active inference to LLMs. That’s working extremely well, actually. So that’s fascinating to me.
you might say that it has an internal program that allows it to be faster and more accurate than a child
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
Looking at how gpt-4 did on the benchmark when I gave it some screenshots, the thing it failed at was the visual “pattern matching” (things completely solved by my system 1) rather than the abstract reasoning.
Yes, the point is that it can’t pattern match because it has never seen such examples. And, as humans, we are able to do well on the task because we don’t simply rely on pattern matching, we use system 2 reasoning (in addition) to do well on such a novel task. Given that the deep learning model relies on pattern matching, it can’t do the task.
As Chollet says in the podcast, we will see if multimodal models crack ARC in the next year, but I think researchers should start paying attention rather than dismissing if they are incapable of doing so in the next year.
But for now, “LLMs do fine with processing ARC-like data by simply fine-tuning an LLM on subsets of the task and then testing it on small variation.” It encodes solution programs just fine for tasks it has seen before. It doesn’t seem to be an issue of parsing the input or figuring out the program. For ARC, you need to synthesize a new solution program on the fly for each new task.
Would it change your mind if gpt-4 was able to do the grid tasks if I manually transcribed them to different tokens? I tried to manually let gpt-4 turn the image to a python array, but it indeed has trouble performing just that task alone.
For concreteness. In this task it fails to recognize that all of the cells get filled, not only the largest one. To me that gives the impression that the image is just not getting compressed really well and the reasoning gpt-4 is doing is just fine.
Okay, hot take: I don’t think that ARC tests “system 2 reasoning” and “solving novel tasks”, at least, in humans. When I see simple task, I literally match patterns, when I see complex task I run whatever patterns I can invent until they match. I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
They’ve tested ARC with children and Mechanical Turk workers, and they all seem to do fine despite the average person not being a fan of “Myst-esque games.”
Do you believe LLMs are just a few OOMs away from solving novel tasks like ARC? What is different that is not explained by what Chollet is saying?
By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Let’s start with the end:
Why do you think that they don’t already do that?
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
As per “Transformers learn in-context by gradient descent”, which Gwern also mentions in the comment that @quetzal_rainbow links here.
Looking at how gpt-4 did on the benchmark when I gave it some screenshots, the thing it failed at was the visual “pattern matching” (things completely solved by my system 1) rather than the abstract reasoning.
Yes, the point is that it can’t pattern match because it has never seen such examples. And, as humans, we are able to do well on the task because we don’t simply rely on pattern matching, we use system 2 reasoning (in addition) to do well on such a novel task. Given that the deep learning model relies on pattern matching, it can’t do the task.
I think humans just have a better visual cortex and expect this benchmark too to just fall with scale.
As Chollet says in the podcast, we will see if multimodal models crack ARC in the next year, but I think researchers should start paying attention rather than dismissing if they are incapable of doing so in the next year.
But for now, “LLMs do fine with processing ARC-like data by simply fine-tuning an LLM on subsets of the task and then testing it on small variation.” It encodes solution programs just fine for tasks it has seen before. It doesn’t seem to be an issue of parsing the input or figuring out the program. For ARC, you need to synthesize a new solution program on the fly for each new task.
Would it change your mind if gpt-4 was able to do the grid tasks if I manually transcribed them to different tokens? I tried to manually let gpt-4 turn the image to a python array, but it indeed has trouble performing just that task alone.
For concreteness. In this task it fails to recognize that all of the cells get filled, not only the largest one. To me that gives the impression that the image is just not getting compressed really well and the reasoning gpt-4 is doing is just fine.