1. Why do we need to estimate the difficulty of open problems? There is a common sentiment in math that there are “no low hanging fruit” and there is some truth to this claim. Open problems sit at the cusp of math knowledge these days, so unless you’re highly specialized you’re probably not finding + solving them. If a model can solve any open problem, regardless of “difficulty”, then it has proved its capability. In short, this requirement seems contrived.
Roughly, it’s because in order for a benchmark to be useful at identifying practical AI capabilities, then we need the benchmark to neither be too easy nor too hard.
For example, if the statement to be proved is say independent of ZFC, then no computer that can be computed from a Turing Machine (which includes all LLMs) can resolve the conjecture, and due to independent statements, you can make conjectures that are arbitrarily hard to solve, and even the non-independent conjectures may in practice be unsolvable by any human or AI for a long time, which means the benchmark is less useful for real AIs.
The first point seems weak, as you could argue that doing math is a skill rather than a reflection of “genuine reasoning skills”.
I actually agree with this, though in a way, you can have both holding at once.
The second point strikes a pet peeve of mine: the overreliance of Lean in the broad math community. It seems to be the only proof assistant that people know. I don’t particularly like it, nor do I like Coq, as these proof assistants feel a little alien compared to how I actually do math. The missing undergraduate math topics in Lean:
While I am not experienced at all in formalizing math, and thus am willing to update and be corrected by any expert on mathematics, especially those that formalize mathematics in proof assistants, I’d expect 2 language independent reasons for why formalizing mathematics in proof assistants are difficult:
There just aren’t that many people that want to formalize mathematics by default, compared to doing something else.
It’s hard to formalize the details, because in a math paper, you can get away with being a little informal (though still pretty rigorous), and in particular you can elide away details that you have to explicitly do if you want to formalize a proof, and the skill of paying attention to details that matter when formalizing a proof aren’t usually good enough to make formalizing proofs be natural for a lot of mathematicians.
To be clear, maybe these reasons are wrong, and I’m willing to accept correction, but this would be my view on why formalizing undergraduate mathematics is hard.
I agree there should be something closer to a naturalistic benchmark like proving conjectures, though confounds are going to be a problem if you go that route.
For example, if the statement to be proved is say independent of ZFC, then no computer that can be computed from a Turing Machine (which includes all LLMs) can resolve the conjecture, and due to independent statements, you can make conjectures that are arbitrarily hard to solve, and even the non-independent conjectures may in practice be unsolvable by any human or AI for a long time, which means the benchmark is less useful for real AIs.
I don’t believe this is true, actually! What do you mean by “resolve the conjecture”? If you mean write up with a proof of it, then of course you can write a turing machine that will write a proof of the conjecture, it’s just infinite monkeys. ZFC is best thought of as the “minimal set of axioms to do most math”. It’s not anything particularly special. You can have various foundations such as ETCS, NF, Type theory, etc. If we have a model that can genuinely reason mathematically, then the set of axioms the model uses should be immaterial to its mathematical ability. In fact, it should certainly be able to handle more or less axioms, like replacing full choice with countable choice etc. Maybe I misunderstood your point here.
While I am not experienced at all in formalizing math, and thus am willing to update and be corrected by any expert on mathematics, especially those that formalize mathematics in proof assistants, I’d expect 2 language independent reasons for why formalizing mathematics in proof assistants are difficult:
But my point was that there are things that should be extremely easy, like proving lemmas about elementary row transformations, that have not been done in Lean yet. That is not due to a lack of people formalizing, but due to fundamental limitations with the proof assistant. The point that I’m failing to make explicit is that this seems like a copout. The ultimate naturalistic benchmark for an LLM’s math ability is being able to formalize the undergraduate math curriculum! But it starts with having a proof assistant that is amenable to the formalization project, which seems to be the bottleneck today.
Roughly, it’s because in order for a benchmark to be useful at identifying practical AI capabilities, then we need the benchmark to neither be too easy nor too hard.
For example, if the statement to be proved is say independent of ZFC, then no computer that can be computed from a Turing Machine (which includes all LLMs) can resolve the conjecture, and due to independent statements, you can make conjectures that are arbitrarily hard to solve, and even the non-independent conjectures may in practice be unsolvable by any human or AI for a long time, which means the benchmark is less useful for real AIs.
I actually agree with this, though in a way, you can have both holding at once.
While I am not experienced at all in formalizing math, and thus am willing to update and be corrected by any expert on mathematics, especially those that formalize mathematics in proof assistants, I’d expect 2 language independent reasons for why formalizing mathematics in proof assistants are difficult:
There just aren’t that many people that want to formalize mathematics by default, compared to doing something else.
It’s hard to formalize the details, because in a math paper, you can get away with being a little informal (though still pretty rigorous), and in particular you can elide away details that you have to explicitly do if you want to formalize a proof, and the skill of paying attention to details that matter when formalizing a proof aren’t usually good enough to make formalizing proofs be natural for a lot of mathematicians.
To be clear, maybe these reasons are wrong, and I’m willing to accept correction, but this would be my view on why formalizing undergraduate mathematics is hard.
I agree there should be something closer to a naturalistic benchmark like proving conjectures, though confounds are going to be a problem if you go that route.
I don’t believe this is true, actually! What do you mean by “resolve the conjecture”? If you mean write up with a proof of it, then of course you can write a turing machine that will write a proof of the conjecture, it’s just infinite monkeys. ZFC is best thought of as the “minimal set of axioms to do most math”. It’s not anything particularly special. You can have various foundations such as ETCS, NF, Type theory, etc. If we have a model that can genuinely reason mathematically, then the set of axioms the model uses should be immaterial to its mathematical ability. In fact, it should certainly be able to handle more or less axioms, like replacing full choice with countable choice etc. Maybe I misunderstood your point here.
But my point was that there are things that should be extremely easy, like proving lemmas about elementary row transformations, that have not been done in Lean yet. That is not due to a lack of people formalizing, but due to fundamental limitations with the proof assistant. The point that I’m failing to make explicit is that this seems like a copout. The ultimate naturalistic benchmark for an LLM’s math ability is being able to formalize the undergraduate math curriculum! But it starts with having a proof assistant that is amenable to the formalization project, which seems to be the bottleneck today.