I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.
I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.