ARC-AGI is a genuine AGI test but o3 cheated :(

The developer of ARC-AGI says o3 is not AGI, and admits his test isn’t really an AGI test.[1] But I think it is an AGI test.

From first principles, AGI tests should be easy to make, because the only reason current AI isn’t AGI, is that it fails at a lot of things human workers do on a daily basis.

I feel the ARC-AGI successfully captures these failings. When an AI encounters a problem where the solution isn’t in its training set, and where even the set of rules for how to solve such a problem isn’t in its training set, it has to figure out everything from first principles. Current AI fails miserably at these problems because they aren’t AGI.

The ARC-AGI questions are not in any LLM’s training set, because almost no human would write out the reasoning by which they solve these kinds of visual puzzles. Maybe some humans will explain IQ tests on a video, but the video transcript would be useless without the accompanying images.

In summary, the reason that current AI cannot solve the ARC-AGI questions is probably the true reason that current AI cannot replace most human work.[2] They cannot effectively learn new tasks with little training data.

OpenAI’s o3 crushed the ARC-AGI. So will it also replace most human work, and be “AGI?” Maybe… except it “cheated,” in the sense it trained on the test (instead of testing its abilities to start from scratch).

From ARC Prize:

Note on “tuned”: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

They ask “how much of the performance is due to ARC-AGI data.” Probably most of it. If untuned o3 can do as well, don’t you think OpenAI would publish that (in addition to tuned o3)?

By training o3 on the public training set, the ARC-AGI no longer becomes an AGI test. It becomes yet another test of memorizing rules from its training data. This is still impressive, but something else.

I do not know what exactly OpenAI did. Did they let o3 spend a long time generating chains of thought, and reward ones which led to the correct answer? If that failed, did they have a human give examples of correct reasoning steps, and train it on that first? I don’t know.

They admitted they “cheated,” without saying how they “cheated.”

To be fair, this isn’t really cheating in the sense they are allowed to use data, and it’s why it’s called a “public training set.” But the version of the test that allows this, is not the AGI test.

Also to be fair, o3 is still a force to be reckoned with, and a reason to be both impressed and worried. I have no hard feelings against OpenAI, because almost all AI labs are doing this, all the time. If you don’t do this you won’t survive in the toxic battlefield of AI lab competition. We just have to take AI lab demonstrations with a healthy dose of skepticism and a grain of salt.

EDIT: People at OpenAI deny “fine-tuning” o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like “we didn’t use a separate derivative of o3 (that’s fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set.” (See my reply)

  1. ^

    This is according to NewScientist.

    The “developer of ARC-AGI” refers to François Chollet.

  2. ^

    I guess another reason AI cannot replace most human work is reliability. The low reliability prevents AI from replacing high-responsibility human labour, and the poor ability to learn new tasks prevents AI from replacing high-innovation human labour.