Dan H comments on AI Forecasting: Two Years In

Dan H 21 Aug 2023 16:25 UTC
3 points
1
It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there’s no calculator access. It’s well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That’s no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.

I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.
- O O 21 Aug 2023 16:31 UTC
  2 points
  2
  Parent
  I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
  - Dan H 21 Aug 2023 16:53 UTC
    3 points
    0
    Parent
    The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.