Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category.
Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeometry.)
AlphaProof works in the “obvious” way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean. One not-so-obvious thing, though: “The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.”
[EDITED to add:] Or maybe it doesn’t work in the “obvious” way. As cubefox points out in the comments below, DeepMind’s description doesn’t explicitly say that it does that, and they’re infuriatingly vague about how “AlphaProof generates solution candidates”. The previous paragraph at least insinuates that it does it with an LLM, but it’s very unclear.
(That last bit is reminiscent of something from the world of computer go: a couple of years ago someone trained a custom version of KataGo specifically to solve the infamous Igo Hatsuyoron problem 120, starting with ordinary KataGo and feeding it training data containing positions reachable from the problem’s starting position. They claim to have laid that problem to rest at last.)
AlphaGeometry is similar but uses something specialized for (I think) Euclidean planar geometry problems in place of Lean. The previous version of AlphaGeometry allegedly already performed at gold-medal IMO standard; they don’t say anything about whether that version was already able to solve the 2024 IMO problem that was solved using AlphaGeometry 2.
AlphaProof was able to solve questions 1, 2, and 6 on this year’s IMO (two algebra, one number theory). It produces Lean-formalized proofs. AlphaGeometry 2 was able to solve question 4 (plane geometry). It produces proofs in its own notation.
The solutions found by the Alpha… systems are at https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/index.html. (There are links in the top-of-page navbar to solutions to the individual problems.)
(If you’re curious about the IMO questions or want to try them yourself before looking at the machine-generated proofs, you can find them—and those for previous years—at https://www.imo-official.org/problems.aspx.)
One caveat (note: an earlier version of what I wrote failed to notice this and quite wrongly explicitly claimed something different): “First, the problems were manually translated into formal mathematical language for our systems to understand.” It feels to me like it shouldn’t be so hard to teach an LLM to convert IMO problems into Lean or whatever, but apparently they aren’t doing that yet. [EDITED to add:] But it seems they are doing this to generate training data. Perhaps the fact that they didn’t do it for the IMO problem statements themselves reveals something about the limitations of their English-to-Lean translator?
Another caveat: “Our systems solved one problem within minutes and took up to three days to solve the others.” Later on they say that AlphaGeometry 2 solved the geometry question within 19 seconds, so I guess that was also the one that was done “within minutes”. Three days is a lot longer than human IMO contestants get given, but this feels to me like the sort of thing that will predictably improve pretty rapidly.
In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said “If Eliezer wins, he gets 1 bit of epistemic credit.” I’m not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.
If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven’t already.
[Edit: this comment is probably retracted, although I’m still confused; see discussion below.]
I’d like clarification from Paul and Eliezer on how the bet would resolve, if it were about whether an AI could get IMO silver by 2024.
Besides not fitting in the time constraints (which I think is kind of a cop-out because the process seems pretty parallelizable), I think the main reason that such a bet would resolve no is that problems 1, 2, and 6 had the form “find the right answer and prove it right”, whereas the DeepMind AI was given the right answer and merely had to prove it right. Often, finding the right answer is a decent part of the challenge of solving an Olympiad problem. Quoting more extensively from Manifold commenter Balasar:
I’m pretty sure what’s going on is:
The system automatically generates candidate theorems it might try to prove, expressing possible answers, and attempts to prove them.
In this case, the version of the theorem it ended up being able to prove was the one with 2 in that position. (Which is just as well, since—I assume, not having actually tried to solve the problem for myself—that is in fact the unique number for which such a theorem is true.)
So the thing you end up getting a proof of includes the answer, but not because the system was told the answer in advance.
It would be nice to have this more explicitly from the AlphaProof people, though.
[EDITED to add:] Actually, as per the tweet from W T Gowers quoted by “O O” elsewhere in this thread, we do have it explicitly, not from the AlphaProof people but from one of the mathematicians the AlphaProof people engaged to evaluate their solutions.
https://x.com/wtgowers/status/1816839783034843630
That comment doesn’t seem to be correct.
But, in this tweet https://x.com/wtgowers/status/1816509813440082014 he says:
So maybe e.g. the (not very auto-) autoformalization part produced a theorem-statement template with some sort of placeholder where the relevant constant value goes, and AlphaProof knew it needed to find a suitable value to put in the gap.
This part seems to just be to not allow an LLM translation to get the problem slightly wrong and mess up the score as a result.
It would be a shame for your once a year attempt to have even a 2% chance of being messed up by an LLM hallucination.
I think it would still be correct to it resolve no for time reasons, even if it feels like it could be parallelisable (if it were parallelised, I would feel differently).
Early 2023 I bet $500 on AI winning the IMO gold medal by 2026. This was a 1:1 bet against Michael Vassar, meaning I attributed >50% to this. It now seems very likely that I’m going to win.
To me, this was to be expected as a straightforward application of AlphaZero-like self-play amplification and destillation. The missing piece was the analogous policy network, which was a convolutional neural network for the AlphaZero board games. Once it became quite clear that existing LLMs were capable of being smart enough to generate good heuristics to this (with enough data), it seemed quite obvious to me that self-play guided by an LLM-policy heuristic would work.
I don’t think this is true, actually. Look at the diagram of AlphaProof’s reinforcement learning training loop. The proof part (“solver network”) seems to be purely RL based; it even uses the same algorithm as AlphaZero. (The article actually describes AlphaProof as “a new reinforcement-learning based system”.)
The contribution of the LLM to AlphaProof seems to be only in providing the left part of the diagram (the “formalizer network”). Namely translating, and somehow expanding, one million human written proofs into 100 million formal Lean proofs. I’m not sure how the LLM increases the number of proofs by 100x, but I assume for each successfully formalized human-written proof it also generates 100 (simple?) synthetic variations. This interpretation is also corroborated by this statement:
So AlphaProof is actually a rather narrow RL system, not unlike the original AlphaGo. The latter was also bootstrapped on human data (expert Go games), similar to how AlphaProof uses (formalized) human proofs for initial training. (Unlike AlphaGo and AlphaProof, AlphaGo Zero and the original AlphaZero did not rely on any non-synthetic / human training data.)
So, if LLMs like GPT-4 and Gemini are considered to be fairly general AI systems (some even consider them AGI), while systems like AlphaGo or AlphaZero are regarded as narrow AI—then AlphaProof counts as narrow AI. Which contradicts views like the one by Dwarkesh Patel who thought solving IMO math problems may be AGI complete, and confirms the opinion of people like Grant Sanderson (3Blue1Brown) who thought that training on and solving formal math proofs is formally very similar to AlphaGo solving board games.
That being said, unlike AlphaProof, AlphaGeometry 2, which solved the one geometry problem, is not RL based. It does indeed use an LLM (a Gemini-like model trained on large amounts of synthetic training data) when coming up with proofs. Though it works in tandem with a non-ML symbolic deduction engine. Such systems are largely based on brute force search, similar to how DeepBlue played chess. So AlphaGeometry is simultaneously more general (LLM) and less general (symbolic deduction engine) than AlphaProof (an RL system).
We obviously should wait for the paper and more details, but I am certain this is incorrect. Both your quote and diagram is clear that it is one million problems, not proofs.
Well, what kind of problem do you think will help with learning how to prove things, if not proofs? AlphaGo & co learned to play games by being trained with games. And AlphaProof uses the AlphaZero algorithm.
The plain interpretation is that only statements to be proved (or disproved) were sourced from human data, without any actual proof steps. In Go analogy, it is like being given Go board positions without next moves.
It makes a lot of sense this is needed and helpful, because winning a game of Go from the empty board is a different and easier problem than playing best moves from arbitrary Go positions. Igo Hatsuyoron mentioned in the original post is a good example; additional training was needed, because such positions never come up in actual games.
Imagine AlphaZero trained from randomly sampled Go positions, each intersection being black/white/empty with uniform probability. It would play much worse game of Go. Fortunately, how to sample “relevant” Go positions is an easy problem: you just play the game, initial N moves sampled at higher temperature for diversity.
In comparison, how to sample relevant math positions is unclear. Being good at finding proofs in arbitrary formal systems from arbitrary set of axioms is actually quite different from being good at math. Using human data sidesteps this problem.
I don’t understand what you mean here. How would merely seeing conjectures help with proving them? You arguably need to see many example proofs of example conjectures. Otherwise it would be like expecting a child, who has never seen a proof, learning to prove conjectures merely by showing it a lot of conjectures.
The feedback is from Lean, which can validate attempted formal proofs.
Hmm, maybe you’re right. I thought I’d seen something that said it did that, but perhaps I hallucinated it. (What they’ve written isn’t specific enough to make it clear that it doesn’t do that either, at least to me. They say “AlphaProof generates solution candidates”, but nothing about how it generates them. I get the impression that it’s something at least kinda LLM-like, but could be wrong.)
The diagram actually says it uses the AlphaZero algorithm. Which obviously doesn’t involve an LLM.
The AlphaZero algorithm doesn’t obviously not involve an LLM. It has a “policy network” to propose moves, and I don’t know what that looks like in the case of AlphaProof. If I had to guess blindly I would guess it’s an LLM, but maybe they’ve got something else instead.
Honestly, I think people are overestimating this. Some quick thoughts:
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing. Also, their solution to one very hard problem was exactly the same as one from a human competitor, and they likely had this exact problem already in their training data, so their training data was likely contaminated.
I’m not really surprised that this type of algorithm will perform well on ad-hoc & constructive type problems. Generally, in a lot of cases, you differentiate the problem into small cases, find a pattern (mathematical model), and generalize from it. These kinds of math olympiad problems are quite similar in nature. So current models have a very good sense of what a “pattern” is and, of course, have many more FLOPS to prove their hypotheses in small cases. Personally, I find solving ad-hoc problems pretty boring, because mostly you’re taking in some patterns as hypotheses and just brute-forcing your way through to see what works and what doesn’t → machines can do that faster.
So I’m not really sure they can solve their definition of “creativity” by adding more and more additional compute since all the Alpha-style algorithms (RL in AlphaZero, AlphaGo, etc.) are currently trying to make finding those heuristics as good as possible while trying to reduce branching in order to save compute. I don’t know how far you can go with throwing more and more compute at it and brute-forcing approaches with certain heuristics until finding a solution to the problem.
summing up these are similar methods used by researchers decades ago, but we have a lot more FLOPS now.
The LLM’s only use was to translate problems to a formal language, in this case Lean, in order to generate training data for their RL algorithm used in all of the Alpha-type algos.
That’s also why I think they started with these types of problems, because you can approximate the compute needed. The generative part involved in geometry, ad-hoc, etc., is much less than in algebra, combinatorics, etc., since it’s much more mechanical in nature. You will be able to solve a lot in these types of problems with just scaling compute.
AlphaProof seems to be augmented with an LLM trained on symbolic chains so from statement → examples → conjecture → proof/disproof of conjecture, maybe even multiple conjecture nodes. So on the other hand, within natural language, which can come up with conjectures from a next-token point of perspective, could be working.
The system could comprise a “generator” model creating exhaustive examples from the statement, an “observer” model studying these examples to find and intersect their properties, and AlphaProof as the “prover” to prove or disprove conjectures. Whether using LLMs or traditional models is up for experimentation, but each component must be formally verified.
I think you can see that at work in the solution of problem 2, which is correct but feels very unintuitive, and there are much more elegant solutions to this problem. (I specifically only looked at this one for now, maybe will update in the edit for the other problems)
I saw someone claiming the opposite:
The methods used to solve these “new” problems are, as I’ve already stated, highly amenable to brute-force approaches. It’s more of a computation problem [BitterLesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I’m again not surprised these kinds of problems got solved.
While these problems may be new, they employ very similar methods to those definitely used in previous competitive programming and Math Olympiad problems. I don’t think the author has really looked into the specifics of how AlphaGeometry and AlphaProof have come up with these solutions. It’s honestly disappointing to see that they were able to mislead such people (if he truly hasn’t looked into the specifics). It seems more like he wants to use his status to push a particular narrative.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
Since this is a point that seemingly causes a lot of confusion and misunderstanding, I’ll try to find some time to write down my reasoning and thoughts in a more exhaustive post.
I don’t know if you guessed this prior to reading the post, but if so good guess:
I’ve been wondering about this ever since I saw that sentence, so now I’m curious to see your post explaining your reasoning.
ETA: I just saw sunwillrise’s great comment on this and am now wondering how your reasoning compares
I’d be interested in reading more about this. Could you provide a link?
To the contrary, this used to be very hard. Of course, LLM can learn to translate “real number” to “R”. But that’s only possible because R is formalized in Lean/Mathlib! Formalization of real number is a research level problem, which in history occupied much of the 19th century mathematics.
Recently I came across a paper Codification, Technology Absorption, and the Globalization of the Industrial Revolution which discusses the role of translation and dictionary in industrialization of Japan. The following quote is illustrative.
Just as, say, translating The Wealth of Nations to Japanese is of entirely different difficulty between the 19th century and 20th century (the 19th century Japanese started by debating how to translate “society”), formalizing IMO problems in Lean is only workable thanks to Mathlib. It would not be workable in other formal systems lacking similarly developed math library, and formalizing research mathematics in Lean is similarly unworkable at the moment, until Mathlib is further developed to cover definitions and background theorems. In the past, ambitious formalization projects usually spent half their time formalizing definitions and background results needed.
Anyone have a good intuition for why Combinatorics is harder than Algebra, and/or why Algebra is harder than Geometry? (For AIs). Why is it different than for humans?
I’m not sure it is different than for humans, honestly. First, I should give a standard disclaimer that different students have different strengths and weaknesses in terms of mathematical problem-solving ability, as well as different aesthetic preferences for what types of problems they like to work on, so any overview like the one I am about to give is necessarily reductive and doesn’t capture the full range of opinions on this matter.
As I recall from my own Math olympiad days (and, admittedly, it has been quite a while), Combinatorics problems were generally considered to be the most difficult/tricky/annoying for the majority of contestants. This was because, unlike the vast majority of other fields, Combinatorics problems required a certain set of intuitions that could be described as “combinatorial taste”, and solving them required you to understand the particular problem in front of you deeply, on an S1 level. You generally had to “play around” with the set-up for a fair bit, developing a better and better picture of what’s going on, before certain insights just clicked in your head and guided you on the path to a solution.[1]
Geometry, by contrast, was much more straightforward.[2] Not in the sense that the problems were necessarily easier from an “objective” standpoint,[3] but because you had an essentially fixed set of techniques and configurations you needed to understand (and mostly memorize) that would solve the vast majority of problems. As Evan Chen has said:
You also often had the opportunity to turn a geometry problem into an algebra bash, through complex numbers, barycentric coordinates, Cartesian coordinates, trigonometry, etc. These were skills you developed naturally by practice, and in many cases, if you were an experienced and well-prepared contestant, whenever you were given a Geometry problem, you would be able to instantly remember related problems and ideas and figure out the rough plan of how you needed to go about solving the problem. This is completely the opposite of Combinatorics, where you did indeed sometimes have recurring patterns or broad themes,[4] but you mostly had to take each (worthwhile, IMO-level) problem on its own terms, instead of falling back on previously-memorized ideas.[5]
Rather interestingly, this phenomenon (the distinction between the aesthetic of Combinatorics being centered on “deeply understanding and solving one problem” and the rest of the fields as “applying general techniques that expand the understanding of the entirety of Geometry, Algebra, etc.”) was also noted at the research level by Timothy Gowers in his excellent essay on “The Two Cultures of Mathematics.”[6]
So, from this perspective, I suppose it would make a lot of sense for a rather narrow AI system to be better at areas of olympiad Math that are more easily “systematizable” (such as Algebra and Geometry) and worse at the areas that seem to require a sort of deep understanding and problem-solving (like Combinatorics) that could plausibly get it closer to general intelligence.
I think the standard example to illustrate this is problem 2 of the 2011 IMO (nicknamed the “windmill problem”), which had an average score of 1.5/7, which was roughly half of what the organizers generally aim for (meaning ~ 3⁄7, as in 2012), mostly because they underestimated how difficult it would be for the contestants to understand the (combinatorial) set-up properly given the time constraints.
And so was Algebra, but to a lesser extent (although contestants working on inequalities, for example, also benefitted from a boatload of books filled to the brim with dozens of techniques and proof methods specifically geared for this).
I’m not sure how such a thing could even be defined, anyway.
If there actually was no structure to Combinatorics problems, then human minds would not be able to predictably perform better at solving them over time as they practiced more and more, which is false.
To be fair, I have heard through the grapevine that this is getting less and less true as the years pass, because trainers are getting better and better at identifying and compiling books/documents, etc., for training specific combinatorial problem-solving techniques.
Of course, there is a lot of nuance here that I am not getting into, and you shouldn’t come away with the impression that Combinatorics is all just one-time tricks and the rest of math is just memorizing proofs that worked on previous problems. Reality is far more complex and multi-dimensional than that; there are parts of Combinatorics that work on the basis of deep theories (such as when using generating functions or the probabilistic method or when making connections with Statistical Mechanics, etc.), and also non-Combinatorial results that seem to have arisen from an almost purely individualized assessment of the problem (David Corfield mentions arithmetic progressions among primes as an example).
Is there a paper/technical report?
[ETA: No, no paper at this point]
I searched for it and found none. The twitter conversation also seems to imply that there has not been a paper / technical report out yet.
No, but these tweets describe the basic concept: extension of AlphaZero to train on theorems in Lean automatically converted from natural-language proofs.
This looks pretty relevant to https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer and the related manifold bet: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat
The market is surprised.
I don’t understand why people aren’t freaking out from this news. Waiting for the paper I guess.
I’m guessing many people assumed an IMO solver would be AGI. However this is actually a narrow math solver. But it’s probably useful on the road to AGI nonetheless.
We don’t know how narrow it is yet. If they did for algebra and number theory something like what they did for geometry in alphageometry (v1), providing it a well-chosen set of operations, then I’ll be more inclined to agree.
People generally expected math AI to progress pretty fast already. I was angry about machine-assisted math being a neglected area before this and my anger levels aren’t substantially increased by the news.
The news is not very old yet. Lots of potential for people to start freaking out.
It’s funny to me that the one part of the problem the AI cannot solve is translating the problem statements to Lean. I guess it’s the only part that the computer has no way to check.
Does anyone know if “translating the problem statements” includes the providing the solution (eg “an even integer” for P1), and the AI just needs to prove the solution correct? Its not clear to me what’s human-written and what’s AI-written, and the solution is part of the “theorem” part which I’d guess is human-written.
Answer: it was not given the solution. https://x.com/wtgowers/status/1816839783034843630?s=46&t=UlLg1ou4o7odVYEppVUWoQ
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?