Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn’t count (this was specified some years back). However, that is an interesting paper worth checking out.
It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113
In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there’s no calculator access. It’s well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That’s no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.
I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.
I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.
That doesn’t make too much sense to me. Here the calculators are running on the same hardware as the model and theoretically transformers can just have a submodel that models a calculator.
I think there’s a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say “no calculators,” they’re not banning mental calculation, rather specific tools.
Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn’t count (this was specified some years back). However, that is an interesting paper worth checking out.
Is it clear these results don’t count? I see nothing in the Metaculus question text that rules it out.
It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there’s no calculator access. It’s well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That’s no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.
I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.
I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.
That doesn’t make too much sense to me. Here the calculators are running on the same hardware as the model and theoretically transformers can just have a submodel that models a calculator.
I think there’s a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say “no calculators,” they’re not banning mental calculation, rather specific tools.