I encourage alignment/safety people to be open-minded about what François Chollet is saying in this podcast:
I think many are blindly bought into the ‘scale is all you need’ and apparently godly nature of LLMs and may be dependent on unfounded/confused assumptions because of it.
Getting this right is important because it could significantly impact how hard you think alignment will be. Here’s @johnswentworth responding to @Eliezer Yudkowsky about his difference in optimism compared to @Quintin Pope (despite believing the natural abstraction hypothesis is true):
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI’s internal ontology at training time. My guess is that more of the disagreement lies here.
I doubt much disagreement between you and I lies there, because I do not expect ML-style training to robustly point an AI in any builder-intended direction. My hopes generally don’t route through targeting via ML-style training.
I do think my deltas from many other people lie there—e.g. that’s why I’m nowhere near as optimistic as Quintin—so that’s also where I’d expect much of your disagreement with those other people to lie.
Chollet’s key points are that LLMs and current deep learning approaches rely heavily on memorization and pattern matching rather than reasoning and problem-solving abilities. Humans (who are ‘intelligent’) use a bit of both.
LLMs have failed at ARC for the last 4 years because they are simply not intelligent and basically pattern-match and interpolate to whatever is within their training distribution. You can say, “Well, there’s no difference between interpolation and extrapolation once you have a big enough model trained on enough data,” but the point remains that LLMs fail at the Abstract Reasoning and Concepts benchmark precisely because they have never seen such examples.
No matter how ‘smart’ GPT-4 may be, it fails at simple ARC tasks that a human child can do. The child does not need to be fed thousands of ARC-like examples; it can just generalize and adapt to solve the novel problem.
One thing to ponder here is, “Are the important kinds of tasks we care about, e.g. coming up with novel physics or inventing nanotech robots, reliant on models gaining the key ability to think and act in the way it would need to solve a novel task like ARC?”
At this point, I think some people might say something like, “How can you say that LLMs are not intelligent when they are solving novel software engineering problems or coming up with new poems?”
Well, again, the answer here is that they have so many examples like this in their pre-training and are just capable of pattern-matching their way to something like writing the code for a specification that the human came up with. I think if there was a really in-depth study, people might be surprised by how much they assumed that LLMs were truly able to generalize to truly novel tasks. And I think it’s worth considering whether this is a key bottleneck in current approaches.
Then again, the follow-up to this is likely, “OK, but we’ll likely easily solve system 2 reasoning very soon; it doesn’t seem like much of a challenge once you scale up the model to a certain level of ‘capability.’”
We’ll see! I mean, so far, current models have failed to make things like Auto-GPT work. Maybe this just requires more trial-and-error, or maybe this is a big limitation of current systems, and you actually do need some “transformer-level paradigm shift”. Here’s @johnswentworth again regarding his hope:
There isn’t really one specific thing, since we don’t yet know what the next ML/AI paradigm will look like, other than that some kind of neural net will probably be involved somehow. (My median expectation is that we’re ~1 transformers-level paradigm shift away from the things which take off.) But as a relatively-legible example of a targeting technique my hopes might route through: Retargeting The Search.
Lastly, you might think, “well it seems pretty obvious that we can just automate many jobs because most jobs will have a static distribution”, but 1) you are still limited by human data to provide examples for learning and adding patterns to its training distribution 2) as Chollet says:
We can automate more and more things. Yes, this is economically valuable. Yes, potentially there are many jobs you could automate away like this. That would be economically valuable. You’re still not going to have intelligence.
So you can ask, what does it matter if we can generate all this economic value? Maybe we don’t need intelligence after all. You need intelligence the moment you have to deal with change, novelty, and uncertainty.
As long as you’re in a space that can be exactly described in advance, you can just rely on pure memorization. In fact, you can always solve any problem. You can always display arbitrary levels of skills on any task without leveraging any intelligence whatsoever, as long as it is possible to describe the problem and its solution very, very precisely.
So, how does Chollet expect us to resolve the issues with deep learning systems?
Let’s break it down:
Deep learning excels at memorization, pattern matching, and intuition—similar to system 1 thinking in humans. It generalizes but only generalizes locally within its training data distribution.
One way to get good system 2 reasoning out of your system is to use something like discrete program search (Chollet’s current favourite approach) because it excels at systematic reasoning, planning, and problem-solving (which is why it can do well on ARC). It can synthesize programs to solve novel problems but suffers from combinatorial explosion and is computationally inefficient.
Chollet suggests that the path forward involves combining the strengths of deep learning with the approach that gets us a good system 2. This is starting to sound similar to @johnswentworth’s intuition that future systems will include “some kind of neural net [...] somehow” (I mean, of course). But essentially, the outer structure would be a discrete program search that systematically explores the space of possible programs that is aided by deep learning to guide the search in intelligent ways – providing intuitions about likely solutions approaches, offering hints when the search gets stuck, pruning unproductive branches, etc.
Basically, it leverages the vast knowledge and pattern recognition of deep learning (limited by its training data distribution), while gaining the ability to reason and adapt to novel situations via program synthesis. Going back to ARC, you’d expect it could solve such novel problems by decomposing them into familiar subproblems and systematically searching for solution programs.
memorization and pattern matching rather than reasoning and problem-solving abilities
In my opinion, this does not correspond to a principled distinction at the level of computation.
For intelligences that employ consciousness in order to do some of these things, there may be a difference in terms of mechanism. Reasoning and pattern matching sound like they correspond to different kinds of conscious activity.
But if we’re just talking about computation… a syllogism can be implemented via pattern matching, a pattern can be completed by a logical process (possibly probabilistic).
But if we’re just talking about computation… a syllogism can be implemented via pattern matching, a pattern can be completed by a logical process (possibly probabilistic).
Perhaps, but deep learning models are still failing at ARC. My guess (and Chollet’s) is that they will continue to fail at ARC unless they are trained on that kind of data (which goes against the point of the benchmark) or you add something else that actually resolves this failure in deep learning models. It may be able to pattern-match to reasoning-like behaviour, but only if specifically trained on that kind of data. No matter how much you scale it up, it will still fail to generalize to anything not local in its training data distribution.
I think this is exactly right. The phrasing is a little confusing. I’d say “LLMs can’t solve truly novel problems”.
But the implication that this is a slow route or dead-end for AGI is wrong. I think it’s going to be pretty easy to scaffold LLMs into solving novel problems. I could be wrong, but don’t bet heavily on it unless you happen to know way more about cognitive psychology and LLMs in combination than I do. it would be foolish to make a plan for survival that relies on this being a major delay.
I can’t convince you of this without describing exactly how I think this will be relatively straightforward, and I’m not ready to advance capabilities in this direction yet. I think language model agents are probably our best shot at alignment, so we should probably actively work on advancing them to AGI; but I’m not sure enough yet to start publishing my best theories on how to do that.
Back to the possibly confusing phrasing Chollet uses: I think he’s using Piaget’s definition of intelligence as “what you do when you don’t know what to do” (he quotes this in the interview). That’s restricting it to solving problems you haven’t memorized an approach to. That’s not how most people use the word intelligence.
When he says LLMs “just memorize”, he’s including memorizing programs or approaches to problems, and they can plug the variables of this particular variant of the problem in to those memorized programs/approaches. I think the question “well maybe that’s all you need to do” raised by Patel is appropriate; it’s clear they can’t do enough of this yet, but it’s unclear if further progress will get them to another level of abstraction of an approach so abstract and general that it can solve almost any problem.
I think he’s on the wrong track with the “discrete program search” because I see more human-like solutions that may be lower-hanging fruit, but I wouldn’t bet his approach won’t work. I’m starting to think that there are many approaches to general intelligence, and a lot of them just aren’t that hard. We’d like to think our intelligence is unique, magical, and special, but it’s looking like it’s not. Or at least, we should assume it’s not if we want to plan for other intelligences well enough to survive. So I think alignment workers should assume LLMs with or without scaffolding might very well pass this hurdle fairly quickly.
Okay, hot take: I don’t think that ARC tests “system 2 reasoning” and “solving novel tasks”, at least, in humans. When I see simple task, I literally match patterns, when I see complex task I run whatever patterns I can invent until they match. I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
They’ve tested ARC with children and Mechanical Turk workers, and they all seem to do fine despite the average person not being a fan of “Myst-esque games.”
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
Do you believe LLMs are just a few OOMs away from solving novel tasks like ARC? What is different that is not explained by what Chollet is saying?
By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Dwarkesh: Jack Cole with a 240 million parameter model got 35% [on ARC]. Doesn’t that suggest that they’re on this spectrum that clearly exists within humans, and they’re going to be saturated pretty soon?
[...]
Chollet: One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now. Most of the time when you’re using an LLM, it’s just doing static inference. The model is frozen. You’re just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.
What Jack Cole is actually doing is that for every test problem, it’s on-the-fly fine-tuning a version of the LLM for that task. That’s really what’s unlocking performance. If you don’t do that, you get like 1-2%, something completely negligible. If you do test time fine-tuning and you add a bunch of tricks on top, then you end up with interesting performance numbers.
What it’s doing is trying to address one of the key limitations of LLMs today: the lack of active inference. It’s actually adding active inference to LLMs. That’s working extremely well, actually. So that’s fascinating to me.
you might say that it has an internal program that allows it to be faster and more accurate than a child
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
One thing that’s really critical to making the model work at all is test time fine-tuning. By the way, that’s something that’s really missing from LLM approaches right now
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
Looking at how gpt-4 did on the benchmark when I gave it some screenshots, the thing it failed at was the visual “pattern matching” (things completely solved by my system 1) rather than the abstract reasoning.
Yes, the point is that it can’t pattern match because it has never seen such examples. And, as humans, we are able to do well on the task because we don’t simply rely on pattern matching, we use system 2 reasoning (in addition) to do well on such a novel task. Given that the deep learning model relies on pattern matching, it can’t do the task.
As Chollet says in the podcast, we will see if multimodal models crack ARC in the next year, but I think researchers should start paying attention rather than dismissing if they are incapable of doing so in the next year.
But for now, “LLMs do fine with processing ARC-like data by simply fine-tuning an LLM on subsets of the task and then testing it on small variation.” It encodes solution programs just fine for tasks it has seen before. It doesn’t seem to be an issue of parsing the input or figuring out the program. For ARC, you need to synthesize a new solution program on the fly for each new task.
Would it change your mind if gpt-4 was able to do the grid tasks if I manually transcribed them to different tokens? I tried to manually let gpt-4 turn the image to a python array, but it indeed has trouble performing just that task alone.
For concreteness. In this task it fails to recognize that all of the cells get filled, not only the largest one. To me that gives the impression that the image is just not getting compressed really well and the reasoning gpt-4 is doing is just fine.
There are other interesting places where LLMs fail badly at reasoning, eg planning problems like block-world or scheduling meetings between people with availability constraints; see eg this paper & other work from Kambhampati.
I’ve been considering putting some time into this as a research direction; the ML community has a literature on the topic but it doesn’t seem to have been discussed much in AIS, although the ARC prize could change that. I think it needs to be considered through a safety lens, since it has significant impacts on the plausibility of short timelines to drop-in-researcher like @leopold’s. I have an initial sketch of such a direction here, combining lit review & experimentation. Feedback welcomed!
(if in fact someone already has looked at this issue through an AIS lens, I’d love to know about it!)
LLMs have failed at ARC for the last 4 years because they are simply not intelligent and basically pattern-match and interpolate to whatever is within their training distribution. You can say, “Well, there’s no difference between interpolation and extrapolation once you have a big enough model trained on enough data,” but the point remains that LLMs fail at the Abstract Reasoning and Concepts benchmark precisely because they have never seen such examples.
No matter how ‘smart’ GPT-4 may be, it fails at simple ARC tasks that a human child can do. The child does not need to be fed thousands of ARC-like examples; it can just generalize and adapt to solve the novel problem.
I don’t get it. I just looked at ARC and it seemed obvious that gpt-4/gpt-4o can easily solve these problems by writing python. Then I looked it up on papers-with-code and it seems close to solved? Probably the ones remaining would be hard for children also. Did the benchmark leak into the training data and that is why they don’t count them?
Thanks for clarifying! I just tried a fewsimple ones by prompting gpt-4o and gpt-4 and it does absolutely horrific job! Maybe trying actually good prompting could help solving it, but this is definitely already an update for me!
I encourage alignment/safety people to be open-minded about what François Chollet is saying in this podcast:
I think many are blindly bought into the ‘scale is all you need’ and apparently godly nature of LLMs and may be dependent on unfounded/confused assumptions because of it.
Getting this right is important because it could significantly impact how hard you think alignment will be. Here’s @johnswentworth responding to @Eliezer Yudkowsky about his difference in optimism compared to @Quintin Pope (despite believing the natural abstraction hypothesis is true):
Chollet’s key points are that LLMs and current deep learning approaches rely heavily on memorization and pattern matching rather than reasoning and problem-solving abilities. Humans (who are ‘intelligent’) use a bit of both.
LLMs have failed at ARC for the last 4 years because they are simply not intelligent and basically pattern-match and interpolate to whatever is within their training distribution. You can say, “Well, there’s no difference between interpolation and extrapolation once you have a big enough model trained on enough data,” but the point remains that LLMs fail at the Abstract Reasoning and Concepts benchmark precisely because they have never seen such examples.
No matter how ‘smart’ GPT-4 may be, it fails at simple ARC tasks that a human child can do. The child does not need to be fed thousands of ARC-like examples; it can just generalize and adapt to solve the novel problem.
One thing to ponder here is, “Are the important kinds of tasks we care about, e.g. coming up with novel physics or inventing nanotech robots, reliant on models gaining the key ability to think and act in the way it would need to solve a novel task like ARC?”
At this point, I think some people might say something like, “How can you say that LLMs are not intelligent when they are solving novel software engineering problems or coming up with new poems?”
Well, again, the answer here is that they have so many examples like this in their pre-training and are just capable of pattern-matching their way to something like writing the code for a specification that the human came up with. I think if there was a really in-depth study, people might be surprised by how much they assumed that LLMs were truly able to generalize to truly novel tasks. And I think it’s worth considering whether this is a key bottleneck in current approaches.
Then again, the follow-up to this is likely, “OK, but we’ll likely easily solve system 2 reasoning very soon; it doesn’t seem like much of a challenge once you scale up the model to a certain level of ‘capability.’”
We’ll see! I mean, so far, current models have failed to make things like Auto-GPT work. Maybe this just requires more trial-and-error, or maybe this is a big limitation of current systems, and you actually do need some “transformer-level paradigm shift”. Here’s @johnswentworth again regarding his hope:
Lastly, you might think, “well it seems pretty obvious that we can just automate many jobs because most jobs will have a static distribution”, but 1) you are still limited by human data to provide examples for learning and adding patterns to its training distribution 2) as Chollet says:
So, how does Chollet expect us to resolve the issues with deep learning systems?
Let’s break it down:
Deep learning excels at memorization, pattern matching, and intuition—similar to system 1 thinking in humans. It generalizes but only generalizes locally within its training data distribution.
One way to get good system 2 reasoning out of your system is to use something like discrete program search (Chollet’s current favourite approach) because it excels at systematic reasoning, planning, and problem-solving (which is why it can do well on ARC). It can synthesize programs to solve novel problems but suffers from combinatorial explosion and is computationally inefficient.
Chollet suggests that the path forward involves combining the strengths of deep learning with the approach that gets us a good system 2. This is starting to sound similar to @johnswentworth’s intuition that future systems will include “some kind of neural net [...] somehow” (I mean, of course). But essentially, the outer structure would be a discrete program search that systematically explores the space of possible programs that is aided by deep learning to guide the search in intelligent ways – providing intuitions about likely solutions approaches, offering hints when the search gets stuck, pruning unproductive branches, etc.
Basically, it leverages the vast knowledge and pattern recognition of deep learning (limited by its training data distribution), while gaining the ability to reason and adapt to novel situations via program synthesis. Going back to ARC, you’d expect it could solve such novel problems by decomposing them into familiar subproblems and systematically searching for solution programs.
In my opinion, this does not correspond to a principled distinction at the level of computation.
For intelligences that employ consciousness in order to do some of these things, there may be a difference in terms of mechanism. Reasoning and pattern matching sound like they correspond to different kinds of conscious activity.
But if we’re just talking about computation… a syllogism can be implemented via pattern matching, a pattern can be completed by a logical process (possibly probabilistic).
Perhaps, but deep learning models are still failing at ARC. My guess (and Chollet’s) is that they will continue to fail at ARC unless they are trained on that kind of data (which goes against the point of the benchmark) or you add something else that actually resolves this failure in deep learning models. It may be able to pattern-match to reasoning-like behaviour, but only if specifically trained on that kind of data. No matter how much you scale it up, it will still fail to generalize to anything not local in its training data distribution.
I think this is exactly right. The phrasing is a little confusing. I’d say “LLMs can’t solve truly novel problems”.
But the implication that this is a slow route or dead-end for AGI is wrong. I think it’s going to be pretty easy to scaffold LLMs into solving novel problems. I could be wrong, but don’t bet heavily on it unless you happen to know way more about cognitive psychology and LLMs in combination than I do. it would be foolish to make a plan for survival that relies on this being a major delay.
I can’t convince you of this without describing exactly how I think this will be relatively straightforward, and I’m not ready to advance capabilities in this direction yet. I think language model agents are probably our best shot at alignment, so we should probably actively work on advancing them to AGI; but I’m not sure enough yet to start publishing my best theories on how to do that.
Back to the possibly confusing phrasing Chollet uses: I think he’s using Piaget’s definition of intelligence as “what you do when you don’t know what to do” (he quotes this in the interview). That’s restricting it to solving problems you haven’t memorized an approach to. That’s not how most people use the word intelligence.
When he says LLMs “just memorize”, he’s including memorizing programs or approaches to problems, and they can plug the variables of this particular variant of the problem in to those memorized programs/approaches. I think the question “well maybe that’s all you need to do” raised by Patel is appropriate; it’s clear they can’t do enough of this yet, but it’s unclear if further progress will get them to another level of abstraction of an approach so abstract and general that it can solve almost any problem.
I think he’s on the wrong track with the “discrete program search” because I see more human-like solutions that may be lower-hanging fruit, but I wouldn’t bet his approach won’t work. I’m starting to think that there are many approaches to general intelligence, and a lot of them just aren’t that hard. We’d like to think our intelligence is unique, magical, and special, but it’s looking like it’s not. Or at least, we should assume it’s not if we want to plan for other intelligences well enough to survive. So I think alignment workers should assume LLMs with or without scaffolding might very well pass this hurdle fairly quickly.
Okay, hot take: I don’t think that ARC tests “system 2 reasoning” and “solving novel tasks”, at least, in humans. When I see simple task, I literally match patterns, when I see complex task I run whatever patterns I can invent until they match. I didn’t run the entire ARC testing dataset, but if I am good at solving it, it will be because I am fan of Myst-esque games and, actually, there are not so many possible principles in designing problems of this sort.
What failure of LLMs to solve ARC is actually saying us, it is “LLM cognition is very different from human cognition”.
They’ve tested ARC with children and Mechanical Turk workers, and they all seem to do fine despite the average person not being a fan of “Myst-esque games.”
Do you believe LLMs are just a few OOMs away from solving novel tasks like ARC? What is different that is not explained by what Chollet is saying?
By “good at solving” I mean “better than average person”.
I think the fact that language model are better at predicting next token than humans implies that LLMs have sophisticated text-oriented cognition and saying “LLMs are not capable to solve ARC, therefore, they are less intelligent than children” is equivalent to saying “humans can’t take square root of 819381293787, therefore, they are less intelligent than calculator”.
My guess that probably we would need to do something non-trivial to scale LLM to superintelligence, but I don’t expect that it is necessary to move from general LLM design principles.
Of course, I acknowledge that LLMs are better at many tasks than children. Those tasks just happen to all be within its training data distribution and not on things that are outside of it. So, no, you wouldn’t say the calculator is more intelligent than the child, but you might say that it has an internal program that allows it to be faster and more accurate than a child. LLMs have such programs they can use via pattern-matching too, as long as it falls into the training data distribution (in the case of Caesar cypher, apparently it doesn’t do so well for number nine – because it’s simply less common in its training data distribution).
One thing that Chollet does mention that helps to alleviate the limitation of deep learning is to have some form of active inference:
Let’s start with the end:
Why do you think that they don’t already do that?
My point is that children can solve ARC not because they have some amazing abstract spherical-in-vacuum reasoning abilities which LLMs lack, but because they have human-specific pattern recognition ability (like geometric shapes, number sequences, music, etc). Brains have strong inductive biases, after all. If you train a model purely on the prediction of a non-anthropogenic physical environment, I think this model will struggle with solving ARC even if it has a sophisticated multi-level physical model of reality, because regular ARC-style repeating shapes are not very probable on priors.
In my impression, in debates about ARC, AI people do not demonstrate a very high level of deliberation. Chollet and those who agree with him are like “nah, LLMs are nothing impressive, just interpolation databases!” and LLM enthusiasts are like “scaling will solve everything!!!!111!” Not many people seem to consider “something interesting is going on here. Maybe we can learn something important about how humans and LLMs work that doesn’t fit into simple explanation templates.”
Since AFAIK in-context learning functions pretty similarly to fine-tuning (though I haven’t looked into this much), it’s not clear to me why Chollet sees online fine-tuning as deeply different from few-shot prompting. Certainly few-shot prompting works extremely well for many tasks; maybe it just empirically doesn’t help much on this one?
As per “Transformers learn in-context by gradient descent”, which Gwern also mentions in the comment that @quetzal_rainbow links here.
Looking at how gpt-4 did on the benchmark when I gave it some screenshots, the thing it failed at was the visual “pattern matching” (things completely solved by my system 1) rather than the abstract reasoning.
Yes, the point is that it can’t pattern match because it has never seen such examples. And, as humans, we are able to do well on the task because we don’t simply rely on pattern matching, we use system 2 reasoning (in addition) to do well on such a novel task. Given that the deep learning model relies on pattern matching, it can’t do the task.
I think humans just have a better visual cortex and expect this benchmark too to just fall with scale.
As Chollet says in the podcast, we will see if multimodal models crack ARC in the next year, but I think researchers should start paying attention rather than dismissing if they are incapable of doing so in the next year.
But for now, “LLMs do fine with processing ARC-like data by simply fine-tuning an LLM on subsets of the task and then testing it on small variation.” It encodes solution programs just fine for tasks it has seen before. It doesn’t seem to be an issue of parsing the input or figuring out the program. For ARC, you need to synthesize a new solution program on the fly for each new task.
Would it change your mind if gpt-4 was able to do the grid tasks if I manually transcribed them to different tokens? I tried to manually let gpt-4 turn the image to a python array, but it indeed has trouble performing just that task alone.
For concreteness. In this task it fails to recognize that all of the cells get filled, not only the largest one. To me that gives the impression that the image is just not getting compressed really well and the reasoning gpt-4 is doing is just fine.
There are other interesting places where LLMs fail badly at reasoning, eg planning problems like block-world or scheduling meetings between people with availability constraints; see eg this paper & other work from Kambhampati.
I’ve been considering putting some time into this as a research direction; the ML community has a literature on the topic but it doesn’t seem to have been discussed much in AIS, although the ARC prize could change that. I think it needs to be considered through a safety lens, since it has significant impacts on the plausibility of short timelines to drop-in-researcher like @leopold’s. I have an initial sketch of such a direction here, combining lit review & experimentation. Feedback welcomed!
(if in fact someone already has looked at this issue through an AIS lens, I’d love to know about it!)
I don’t get it. I just looked at ARC and it seemed obvious that gpt-4/gpt-4o can easily solve these problems by writing python. Then I looked it up on papers-with-code and it seems close to solved? Probably the ones remaining would be hard for children also. Did the benchmark leak into the training data and that is why they don’t count them?
Unfortunate name collision: you’re looking at numbers on the AI2 Reasoning Challenge, not Chollet’s Abstraction & Reasoning Corpus.
Thanks for clarifying! I just tried a few simple ones by prompting gpt-4o and gpt-4 and it does absolutely horrific job! Maybe trying actually good prompting could help solving it, but this is definitely already an update for me!