I have been unhappy as well about their restrictions, until I have understood that there seems to be no way to preserve secrecy of the test set (that is, the integrity of the benchmark) if one sends test tasks out to the APIs of commercial LLMs.
(I can imagine a very sophisticated collaboration with sequestered versions of the closed-source models on special servers, if their makers and Chollet were interested in organizing something that complicated. But this would cost a lot (not just equipment, but tons of labor, and, in particular, a lot of work making sure that information of both sides is actually secure). So I understand why they just push all that to a parallel unofficial track with a different, “semi-secret” test set (people having access to OpenAI, Anthropic, and DeepMind logs are likely to be able to find that “semi-secret” test set there now, since some forms of those runs have taken place already at all three of those orgs, so there is no guarantee of no cheating, although one would hope people would refrain from actually cheating).)
I was focusing on runs eligible for the prize in this short linkpost.
Right.
I have been unhappy as well about their restrictions, until I have understood that there seems to be no way to preserve secrecy of the test set (that is, the integrity of the benchmark) if one sends test tasks out to the APIs of commercial LLMs.
(I can imagine a very sophisticated collaboration with sequestered versions of the closed-source models on special servers, if their makers and Chollet were interested in organizing something that complicated. But this would cost a lot (not just equipment, but tons of labor, and, in particular, a lot of work making sure that information of both sides is actually secure). So I understand why they just push all that to a parallel unofficial track with a different, “semi-secret” test set (people having access to OpenAI, Anthropic, and DeepMind logs are likely to be able to find that “semi-secret” test set there now, since some forms of those runs have taken place already at all three of those orgs, so there is no guarantee of no cheating, although one would hope people would refrain from actually cheating).)