OpenAI shared they trained the o3 we tested on 75% of the Public Training set
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Altman: “didn’t go do specific work … just the general effort”
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is “just” a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is “just” a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.