An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.
An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.