It’s important to remember that o3′s score on the ARC-AGI is “tuned” while previous AI’s scores are not “tuned.” Being explicitly trained on example test questions gives it a major advantage.
Note on “tuned”: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
It’s interesting that OpenAI did not test how well o3 would have done before it was “tuned.”
EDIT: People at OpenAI deny “fine-tuning” o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like “we didn’t use a separate derivative of o3 (that’s fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set.” (See my reply)
It’s important to remember that o3′s score on the ARC-AGI is “tuned” while previous AI’s scores are not “tuned.” Being explicitly trained on example test questions gives it a major advantage.
According to François Chollet (ARC-AGI designer):
It’s interesting that OpenAI did not test how well o3 would have done before it was “tuned.”
EDIT: People at OpenAI deny “fine-tuning” o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like “we didn’t use a separate derivative of o3 (that’s fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set.” (See my reply)