Eric Neyman comments on “AI achieves silver-medal standard solving International Mathematical Olympiad problems”

Eric Neyman 26 Jul 2024 16:24 UTC
17 points
0
[Edit: this comment is probably retracted, although I’m still confused; see discussion below.]
I’d like clarification from Paul and Eliezer on how the bet would resolve, if it were about whether an AI could get IMO silver by 2024.
Besides not fitting in the time constraints (which I think is kind of a cop-out because the process seems pretty parallelizable), I think the main reason that such a bet would resolve no is that problems 1, 2, and 6 had the form “find the right answer and prove it right”, whereas the DeepMind AI was given the right answer and merely had to prove it right. Often, finding the right answer is a decent part of the challenge of solving an Olympiad problem. Quoting more extensively from Manifold commenter Balasar:
The “translations” to Lean do some pretty substantial work on behalf of the model. For example, in the theorem for problem 6, the Lean translation that the model is asked to prove includes an answer that was not given in the original IMO problem.
theorem imo_2024_p6 (IsAquaesulian : (ℚ → ℚ) → Prop) (IsAquaesulian_def : ∀ f, IsAquaesulian f ↔ ∀ x y, f (x + f y) = f x + y ∨ f (f x + y) = x + f y) : IsLeast {(c : ℤ) | ∀ f, IsAquaesulian f → {(f r + f (-r)) | (r : ℚ)}.Finite ∧ {(f r + f (-r)) | (r : ℚ)}.ncard ≤ c} 2
The model is supposed to prove that “there exists an integer c such that for any aquaesulian function f there are at most c different rational numbers of the form f(r)+f(−r) for some rational number r, and find the smallest possible value of c".
The original IMO problem does not include that the smallest possible value of c is 2, but the theorem that AlphaProof was given to solve has the number 2 right there in the theorem statement. Part of the problem is to figure out what 2 is.
Link: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html
- gjm 26 Jul 2024 19:37 UTC
  5 points
  0
  Parent
  I’m pretty sure what’s going on is:
  - The system automatically generates candidate theorems it might try to prove, expressing possible answers, and attempts to prove them.
  - In this case, the version of the theorem it ended up being able to prove was the one with 2 in that position. (Which is just as well, since—I assume, not having actually tried to solve the problem for myself—that is in fact the unique number for which such a theorem is true.)
  - So the thing you end up getting a proof of includes the answer, but not because the system was told the answer in advance.
  It would be nice to have this more explicitly from the AlphaProof people, though.
  [EDITED to add:] Actually, as per the tweet from W T Gowers quoted by “O O” elsewhere in this thread, we do have it explicitly, not from the AlphaProof people but from one of the mathematicians the AlphaProof people engaged to evaluate their solutions.
- O O 26 Jul 2024 18:57 UTC
  5 points
  0
  Parent
  https://x.com/wtgowers/status/1816839783034843630
  
  It wasn’t told what to prove. To get round that difficulty, it generated several hundred guesses (many of which were equivalent to each other). Then it ruled out lots of them by finding simple counterexamples, before ending up with a small shortlist that it then worked on.
  
  That comment doesn’t seem to be correct.
  - kave 26 Jul 2024 19:13 UTC
    7 points
    0
    Parent
    But, in this tweet https://x.com/wtgowers/status/1816509813440082014 he says:
    Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
    - gjm 26 Jul 2024 19:40 UTC
      6 points
      1
      Parent
      So maybe e.g. the (not very auto-) autoformalization part produced a theorem-statement template with some sort of placeholder where the relevant constant value goes, and AlphaProof knew it needed to find a suitable value to put in the gap.
    - O O 26 Jul 2024 23:57 UTC
      4 points
      4
      Parent
      This part seems to just be to not allow an LLM translation to get the problem slightly wrong and mess up the score as a result.
      
      It would be a shame for your once a year attempt to have even a 2% chance of being messed up by an LLM hallucination.
- kave 26 Jul 2024 17:50 UTC
  3 points
  1
  Parent
  I think it would still be correct to it resolve no for time reasons, even if it feels like it could be parallelisable (if it were parallelised, I would feel differently).