Honestly, I think people are overestimating this. Some quick thoughts:
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing. Also, their solution to one very hard problem was exactly the same as one from a human competitor, and they likely had this exact problem already in their training data, so their training data was likely contaminated.
I’m not really surprised that this type of algorithm will perform well on ad-hoc & constructive type problems. Generally, in a lot of cases, you differentiate the problem into small cases, find a pattern (mathematical model), and generalize from it. These kinds of math olympiad problems are quite similar in nature. So current models have a very good sense of what a “pattern” is and, of course, have many more FLOPS to prove their hypotheses in small cases. Personally, I find solving ad-hoc problems pretty boring, because mostly you’re taking in some patterns as hypotheses and just brute-forcing your way through to see what works and what doesn’t → machines can do that faster.
So I’m not really sure they can solve their definition of “creativity” by adding more and more additional compute since all the Alpha-style algorithms (RL in AlphaZero, AlphaGo, etc.) are currently trying to make finding those heuristics as good as possible while trying to reduce branching in order to save compute. I don’t know how far you can go with throwing more and more compute at it and brute-forcing approaches with certain heuristics until finding a solution to the problem.
summing up these are similar methods used by researchers decades ago, but we have a lot more FLOPS now.
The LLM’s only use was to translate problems to a formal language, in this case Lean, in order to generate training data for their RL algorithm used in all of the Alpha-type algos.
That’s also why I think they started with these types of problems, because you can approximate the compute needed. The generative part involved in geometry, ad-hoc, etc., is much less than in algebra, combinatorics, etc., since it’s much more mechanical in nature. You will be able to solve a lot in these types of problems with just scaling compute.
AlphaProof seems to be augmented with an LLM trained on symbolic chains so from statement → examples → conjecture → proof/disproof of conjecture, maybe even multiple conjecture nodes. So on the other hand, within natural language, which can come up with conjectures from a next-token point of perspective, could be working.
The system could comprise a “generator” model creating exhaustive examples from the statement, an “observer” model studying these examples to find and intersect their properties, and AlphaProof as the “prover” to prove or disprove conjectures. Whether using LLMs or traditional models is up for experimentation, but each component must be formally verified.
I think you can see that at work in the solution of problem 2, which is correct but feels very unintuitive, and there are much more elegant solutions to this problem. (I specifically only looked at this one for now, maybe will update in the edit for the other problems)
IMO problems are specifically selected to be non-standard. For the previous 10 years, I served as the national coach of the USA International Math Olympiad team ( https://www.quantamagazine.org/po-shen-loh-led-the-u-s-math-team-back-to-first-place-20210216 ). During the IMO itself, the national coaches meet to select the problems that will appear on the exam paper. One of the most important tasks of that group is to avoid problems that have similarity to problems that have appeared anywhere before. During those meetings, national coaches would often dig up an old obscure math competition, on which a similar problem had appeared, and show it to the group, after which the proposed problem would be struck down.
So, this AI breakthrough is totally different from #GPT being able to do standardized tests through pattern-matching. It strikes at the heart of discovery. It’s very common for students to hit a wall the first time they try IMO-style problems, because they are accustomed to learning from example, remembering, and executing similar steps.
The methods used to solve these “new” problems are, as I’ve already stated, highly amenable to brute-force approaches. It’s more of a computation problem [BitterLesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I’m again not surprised these kinds of problems got solved.
While these problems may be new, they employ very similar methods to those definitely used in previous competitive programming and Math Olympiad problems. I don’t think the author has really looked into the specifics of how AlphaGeometry and AlphaProof have come up with these solutions. It’s honestly disappointing to see that they were able to mislead such people (if he truly hasn’t looked into the specifics). It seems more like he wants to use his status to push a particular narrative.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
Since this is a point that seemingly causes a lot of confusion and misunderstanding, I’ll try to find some time to write down my reasoning and thoughts in a more exhaustive post.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
I don’t know if you guessed this prior to reading the post, but if so good guess:
AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct. This included the hardest problem in the competition, solved by only five contestants at this year’s IMO. AlphaGeometry 2 proved the geometry problem, while the two combinatorics problems remained unsolved.
I’ve been wondering about this ever since I saw that sentence, so now I’m curious to see your post explaining your reasoning.
ETA: I just saw sunwillrise’s great comment on this and am now wondering how your reasoning compares
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing.
I’d be interested in reading more about this. Could you provide a link?
Honestly, I think people are overestimating this. Some quick thoughts:
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing. Also, their solution to one very hard problem was exactly the same as one from a human competitor, and they likely had this exact problem already in their training data, so their training data was likely contaminated.
I’m not really surprised that this type of algorithm will perform well on ad-hoc & constructive type problems. Generally, in a lot of cases, you differentiate the problem into small cases, find a pattern (mathematical model), and generalize from it. These kinds of math olympiad problems are quite similar in nature. So current models have a very good sense of what a “pattern” is and, of course, have many more FLOPS to prove their hypotheses in small cases. Personally, I find solving ad-hoc problems pretty boring, because mostly you’re taking in some patterns as hypotheses and just brute-forcing your way through to see what works and what doesn’t → machines can do that faster.
So I’m not really sure they can solve their definition of “creativity” by adding more and more additional compute since all the Alpha-style algorithms (RL in AlphaZero, AlphaGo, etc.) are currently trying to make finding those heuristics as good as possible while trying to reduce branching in order to save compute. I don’t know how far you can go with throwing more and more compute at it and brute-forcing approaches with certain heuristics until finding a solution to the problem.
summing up these are similar methods used by researchers decades ago, but we have a lot more FLOPS now.
The LLM’s only use was to translate problems to a formal language, in this case Lean, in order to generate training data for their RL algorithm used in all of the Alpha-type algos.
That’s also why I think they started with these types of problems, because you can approximate the compute needed. The generative part involved in geometry, ad-hoc, etc., is much less than in algebra, combinatorics, etc., since it’s much more mechanical in nature. You will be able to solve a lot in these types of problems with just scaling compute.
AlphaProof seems to be augmented with an LLM trained on symbolic chains so from statement → examples → conjecture → proof/disproof of conjecture, maybe even multiple conjecture nodes. So on the other hand, within natural language, which can come up with conjectures from a next-token point of perspective, could be working.
The system could comprise a “generator” model creating exhaustive examples from the statement, an “observer” model studying these examples to find and intersect their properties, and AlphaProof as the “prover” to prove or disprove conjectures. Whether using LLMs or traditional models is up for experimentation, but each component must be formally verified.
I think you can see that at work in the solution of problem 2, which is correct but feels very unintuitive, and there are much more elegant solutions to this problem. (I specifically only looked at this one for now, maybe will update in the edit for the other problems)
I saw someone claiming the opposite:
The methods used to solve these “new” problems are, as I’ve already stated, highly amenable to brute-force approaches. It’s more of a computation problem [BitterLesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I’m again not surprised these kinds of problems got solved.
While these problems may be new, they employ very similar methods to those definitely used in previous competitive programming and Math Olympiad problems. I don’t think the author has really looked into the specifics of how AlphaGeometry and AlphaProof have come up with these solutions. It’s honestly disappointing to see that they were able to mislead such people (if he truly hasn’t looked into the specifics). It seems more like he wants to use his status to push a particular narrative.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
Since this is a point that seemingly causes a lot of confusion and misunderstanding, I’ll try to find some time to write down my reasoning and thoughts in a more exhaustive post.
I don’t know if you guessed this prior to reading the post, but if so good guess:
I’ve been wondering about this ever since I saw that sentence, so now I’m curious to see your post explaining your reasoning.
ETA: I just saw sunwillrise’s great comment on this and am now wondering how your reasoning compares
I’d be interested in reading more about this. Could you provide a link?