I think this experiment does not update me substantially towards thinking we are closer toward AGI because the experiment does not show GPT-4o coming up with a strategy to solve the task and then executing it. Rather a human (a general intelligence) has looked at the benchmark then devised an algorithm that will let GPT-4o perform well on the task.
Further, the method does not seem flexible enough to work on a diverse range of tasks and certainly not without human involvement in adapting it.
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
I think this should only non-trivially update you if you had a specific belief like: “Transformers can’t do actual general learning and reasoning. For instance, they can’t do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all.”
(I’m pretty sympathetic to a view like “ARC-AGI isn’t at all representative of hard cognitive tasks and doesn’t really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don’t really update on most of any result.”.)
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.
I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn’t mean that the cognitive work is done by me!
(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn’t really invovle me devising a method!)
Of course, this is a word choice question and if you also think “humans can’t do mechanical engineering and aren’t really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively”, that would be consistent. (But this seems like a very non-traditional use of terms!)
I think this experiment does not update me substantially towards thinking we are closer toward AGI because the experiment does not show GPT-4o coming up with a strategy to solve the task and then executing it. Rather a human (a general intelligence) has looked at the benchmark then devised an algorithm that will let GPT-4o perform well on the task.
Further, the method does not seem flexible enough to work on a diverse range of tasks and certainly not without human involvement in adapting it.
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
I think this should only non-trivially update you if you had a specific belief like: “Transformers can’t do actual general learning and reasoning. For instance, they can’t do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all.”
(I’m pretty sympathetic to a view like “ARC-AGI isn’t at all representative of hard cognitive tasks and doesn’t really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don’t really update on most of any result.”.)
There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.
I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn’t mean that the cognitive work is done by me!
(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn’t really invovle me devising a method!)
Of course, this is a word choice question and if you also think “humans can’t do mechanical engineering and aren’t really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively”, that would be consistent. (But this seems like a very non-traditional use of terms!)