I think this should only non-trivially update you if you had a specific belief like: “Transformers can’t do actual general learning and reasoning. For instance, they can’t do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all.”
(I’m pretty sympathetic to a view like “ARC-AGI isn’t at all representative of hard cognitive tasks and doesn’t really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don’t really update on most of any result.”.)
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.
I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn’t mean that the cognitive work is done by me!
(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn’t really invovle me devising a method!)
Of course, this is a word choice question and if you also think “humans can’t do mechanical engineering and aren’t really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively”, that would be consistent. (But this seems like a very non-traditional use of terms!)
I think this should only non-trivially update you if you had a specific belief like: “Transformers can’t do actual general learning and reasoning. For instance, they can’t do hard ARC-AGI problems (a reasonably central example of a task that requires learning and reasoning) at all.”
(I’m pretty sympathetic to a view like “ARC-AGI isn’t at all representative of hard cognitive tasks and doesn’t really highlight an interesting bottleneck in transformer ability more so than coding/agency benchmarks. Thus, I don’t really update on most of any result.”.)
There has been a bunch of discussion on this sort of point here, here, and on the substack version of the post. So, you might be interested in that discussion.
I think applying your same labeling approach would be considered pretty misleading in the context of human organizations or human education. Just because I build a structure around the model, give the model some examples of solving the problem, and let the model try many times doesn’t mean that the cognitive work is done by me!
(Most of my work is in building good representations to make up for vision issues and providing few-shot examples. I do a bunch of other tweaks which modestly improve performance. Of course, we also select which program to use based on which what passes the tests, but this didn’t really invovle me devising a method!)
Of course, this is a word choice question and if you also think “humans can’t do mechanical engineering and aren’t really doing the cognitive work of mechanical engineering, only humans+schools+CAD software can do mechanical engineering effectively”, that would be consistent. (But this seems like a very non-traditional use of terms!)