Lech Mazur comments on AI Forecasting: Two Years In

Lech Mazur 20 Aug 2023 0:11 UTC
11 points
0
Following your forecast’s closing date, MATH has reached 84.3% as per this paper if counting GPT-4 Code Interpreter: https://arxiv.org/abs/2308.07921v1
- Dan H 20 Aug 2023 16:47 UTC
  2 points
  0
  Parent
  Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn’t count (this was specified some years back). However, that is an interesting paper worth checking out.
  - jsteinhardt 20 Aug 2023 21:44 UTC
    9 points
    6
    Parent
    Is it clear these results don’t count? I see nothing in the Metaculus question text that rules it out.
    - Dan H 21 Aug 2023 16:25 UTC
      3 points
      1
      Parent
      It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there’s no calculator access. It’s well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That’s no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.
      
      I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.
      - O O 21 Aug 2023 16:31 UTC
        2 points
        2
        Parent
        I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.
        Dan H 21 Aug 2023 16:53 UTC
        3 points
        0
        Parent
        The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.
  - O O 21 Aug 2023 16:25 UTC
    2 points
    1
    Parent
    That doesn’t make too much sense to me. Here the calculators are running on the same hardware as the model and theoretically transformers can just have a submodel that models a calculator.
    - Dan H 21 Aug 2023 16:52 UTC
      3 points
      0
      Parent
      I think there’s a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say “no calculators,” they’re not banning mental calculation, rather specific tools.