The key question is “how much of the performance is due to ARC-AGI data.”
If the untuned o3 was anywhere as good as the tuned o3, why didn’t they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt.
I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions.
Edit: oh that reply seems to deny reinforcement learning or at least “fine tuning.” I don’t understand why François Chollet calls the model “tuned” then. Maybe wait for more information I guess.
Edit again: I’m still not sure yet. They might be denying that it’s a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set. I guess a week or so later we might find out what “tuned” truly means.
See my other comment instead.
The key question is “how much of the performance is due to ARC-AGI data.”If the untuned o3 was anywhere as good as the tuned o3, why didn’t they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt.I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions.Edit: oh that reply seems to deny reinforcement learning or at least “fine tuning.” I don’t understand why François Chollet calls the model “tuned” then. Maybe wait for more information I guess.Edit again: I’m still not sure yet. They might be denying that it’s a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set.I guess a week or so later we might find out what “tuned” truly means.[edited more]