This uses a different evaluation set and is not eligible for the prize, so this is just for name recognition. So measuring is possible, but would people be incentivized enough?
The main leaderboard which uses the secret evaluation set is a “no internet connection run” (this is really the only way to make sure that the secret evaluation set would not get disclosed).
You can see tops of both leaderboards here (note the advantage of unmodified Claude 3.5 Sonnet over other frontier models, this is just a one-shot run without any particular prompt manipulations):
But other than that, the concerns are valid as long as runs eligible for the prize are concerned. The rules are really hazy on whether using Open Weights, but not Open Source models are OK:
I have been unhappy as well about their restrictions, until I have understood that there seems to be no way to preserve secrecy of the test set (that is, the integrity of the benchmark) if one sends test tasks out to the APIs of commercial LLMs.
(I can imagine a very sophisticated collaboration with sequestered versions of the closed-source models on special servers, if their makers and Chollet were interested in organizing something that complicated. But this would cost a lot (not just equipment, but tons of labor, and, in particular, a lot of work making sure that information of both sides is actually secure). So I understand why they just push all that to a parallel unofficial track with a different, “semi-secret” test set (people having access to OpenAI, Anthropic, and DeepMind logs are likely to be able to find that “semi-secret” test set there now, since some forms of those runs have taken place already at all three of those orgs, so there is no guarantee of no cheating, although one would hope people would refrain from actually cheating).)
There is a secondary leaderboard where one can demonstrate things with closed-source API-only LLMs:
https://arcprize.org/arc-agi-pub
This uses a different evaluation set and is not eligible for the prize, so this is just for name recognition. So measuring is possible, but would people be incentivized enough?
The main leaderboard which uses the secret evaluation set is a “no internet connection run” (this is really the only way to make sure that the secret evaluation set would not get disclosed).
You can see tops of both leaderboards here (note the advantage of unmodified Claude 3.5 Sonnet over other frontier models, this is just a one-shot run without any particular prompt manipulations):
https://arcprize.org/leaderboard
Ryan Greenblatt is currently the leader on the secondary leaderboard:
https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o
https://github.com/rgreenblatt/arc_draw_more_samples_pub
https://www.kaggle.com/code/rgreenblatt/rg-basic-ported-submission?scriptVersionId=184981551
But other than that, the concerns are valid as long as runs eligible for the prize are concerned. The rules are really hazy on whether using Open Weights, but not Open Source models are OK:
https://www.kaggle.com/competitions/arc-prize-2024/rules
And it is not quite clear how large can a model be to fit into their “no internet access enabled” run, they don’t really say:
https://www.kaggle.com/competitions/arc-prize-2024/overview
I was focusing on runs eligible for the prize in this short linkpost.
Right.
I have been unhappy as well about their restrictions, until I have understood that there seems to be no way to preserve secrecy of the test set (that is, the integrity of the benchmark) if one sends test tasks out to the APIs of commercial LLMs.
(I can imagine a very sophisticated collaboration with sequestered versions of the closed-source models on special servers, if their makers and Chollet were interested in organizing something that complicated. But this would cost a lot (not just equipment, but tons of labor, and, in particular, a lot of work making sure that information of both sides is actually secure). So I understand why they just push all that to a parallel unofficial track with a different, “semi-secret” test set (people having access to OpenAI, Anthropic, and DeepMind logs are likely to be able to find that “semi-secret” test set there now, since some forms of those runs have taken place already at all three of those orgs, so there is no guarantee of no cheating, although one would hope people would refrain from actually cheating).)