For example, if the statement to be proved is say independent of ZFC, then no computer that can be computed from a Turing Machine (which includes all LLMs) can resolve the conjecture, and due to independent statements, you can make conjectures that are arbitrarily hard to solve, and even the non-independent conjectures may in practice be unsolvable by any human or AI for a long time, which means the benchmark is less useful for real AIs.
I don’t believe this is true, actually! What do you mean by “resolve the conjecture”? If you mean write up with a proof of it, then of course you can write a turing machine that will write a proof of the conjecture, it’s just infinite monkeys. ZFC is best thought of as the “minimal set of axioms to do most math”. It’s not anything particularly special. You can have various foundations such as ETCS, NF, Type theory, etc. If we have a model that can genuinely reason mathematically, then the set of axioms the model uses should be immaterial to its mathematical ability. In fact, it should certainly be able to handle more or less axioms, like replacing full choice with countable choice etc. Maybe I misunderstood your point here.
While I am not experienced at all in formalizing math, and thus am willing to update and be corrected by any expert on mathematics, especially those that formalize mathematics in proof assistants, I’d expect 2 language independent reasons for why formalizing mathematics in proof assistants are difficult:
But my point was that there are things that should be extremely easy, like proving lemmas about elementary row transformations, that have not been done in Lean yet. That is not due to a lack of people formalizing, but due to fundamental limitations with the proof assistant. The point that I’m failing to make explicit is that this seems like a copout. The ultimate naturalistic benchmark for an LLM’s math ability is being able to formalize the undergraduate math curriculum! But it starts with having a proof assistant that is amenable to the formalization project, which seems to be the bottleneck today.
I don’t believe this is true, actually! What do you mean by “resolve the conjecture”? If you mean write up with a proof of it, then of course you can write a turing machine that will write a proof of the conjecture, it’s just infinite monkeys. ZFC is best thought of as the “minimal set of axioms to do most math”. It’s not anything particularly special. You can have various foundations such as ETCS, NF, Type theory, etc. If we have a model that can genuinely reason mathematically, then the set of axioms the model uses should be immaterial to its mathematical ability. In fact, it should certainly be able to handle more or less axioms, like replacing full choice with countable choice etc. Maybe I misunderstood your point here.
My point was to gesture at the fact that open conjectures can be so hard that neither humans nor AI can solve them in reasonable time, and that it’s too hard to determine whether a conjecture is even provable by an AI or human in reasonable time, so the proving open conjectures benchmark is not a great benchmark, and I was using independence of ZFC as an example of such conjectures.
Also, a Turing Machine in that case (where a statement is independent but not recursively independent) will only prove independence from ZFC, not a proof or disproof of the conjecture.
But my point was that there are things that should be extremely easy, like proving lemmas about elementary row transformations, that have not been done in Lean yet. That is not due to a lack of people formalizing, but due to fundamental limitations with the proof assistant. The point that I’m failing to make explicit is that this seems like a copout. The ultimate naturalistic benchmark for an LLM’s math ability is being able to formalize the undergraduate math curriculum! But it starts with having a proof assistant that is amenable to the formalization project, which seems to be the bottleneck today.
I definitely would think that formalizing an entire undergraduate math curriculum would be a great benchmark, but it would have to be designed very carefully to ensure the LLM doesn’t break the benchmark’s usefulness.
Also, when you state that it’s extremely easy to formalize certain lemmas/theorems, are you saying that it’s easy to formalize them flat out, including all necessary details, or are you saying that it’s easy to give a paper that would be accepted by mathematicians, and if necessary (though the journey may be painful) you could formalize the argument, including all necessary details?
Bonus question: What do you think are the current limitations of proof assistants such as Lean, more specifically?
I don’t believe this is true, actually! What do you mean by “resolve the conjecture”? If you mean write up with a proof of it, then of course you can write a turing machine that will write a proof of the conjecture, it’s just infinite monkeys. ZFC is best thought of as the “minimal set of axioms to do most math”. It’s not anything particularly special. You can have various foundations such as ETCS, NF, Type theory, etc. If we have a model that can genuinely reason mathematically, then the set of axioms the model uses should be immaterial to its mathematical ability. In fact, it should certainly be able to handle more or less axioms, like replacing full choice with countable choice etc. Maybe I misunderstood your point here.
But my point was that there are things that should be extremely easy, like proving lemmas about elementary row transformations, that have not been done in Lean yet. That is not due to a lack of people formalizing, but due to fundamental limitations with the proof assistant. The point that I’m failing to make explicit is that this seems like a copout. The ultimate naturalistic benchmark for an LLM’s math ability is being able to formalize the undergraduate math curriculum! But it starts with having a proof assistant that is amenable to the formalization project, which seems to be the bottleneck today.
My point was to gesture at the fact that open conjectures can be so hard that neither humans nor AI can solve them in reasonable time, and that it’s too hard to determine whether a conjecture is even provable by an AI or human in reasonable time, so the proving open conjectures benchmark is not a great benchmark, and I was using independence of ZFC as an example of such conjectures.
Also, a Turing Machine in that case (where a statement is independent but not recursively independent) will only prove independence from ZFC, not a proof or disproof of the conjecture.
More here:
https://x.com/ElliotGlazer/status/1871121633018331499
I definitely would think that formalizing an entire undergraduate math curriculum would be a great benchmark, but it would have to be designed very carefully to ensure the LLM doesn’t break the benchmark’s usefulness.
Also, when you state that it’s extremely easy to formalize certain lemmas/theorems, are you saying that it’s easy to formalize them flat out, including all necessary details, or are you saying that it’s easy to give a paper that would be accepted by mathematicians, and if necessary (though the journey may be painful) you could formalize the argument, including all necessary details?
Bonus question: What do you think are the current limitations of proof assistants such as Lean, more specifically?