I think what I mean by “promising at math” is that I would assign a reasonably confident belief that a scaled up model with the same basic architecture as this one, with maybe a more mathematical training set, would be able to accomplish tasks that a mathematician might find helpful to a nonzero degree. For example, I could imagine a variant that is good at ‘concept-linking,’ better than google scholar at least, where you ask it something like “What are some ideas similar to valuations rings that people have explored?” or “Where do Gorenstein rings come up in combinatorics?” and getting a list of some interesting leads, some nonsense, and some lies and fabricated sources. If you try this with ChatGPT, you get mostly either (1) very well-known things that you could find by searching mathoverflow or stackexchange or (2) made-up sources that sound really cool and believable, until you figure out they aren’t real. I would imagine that prompt-engineering could weed out type (1), and that failure mode (2) is something we’ll see gradual progress in. It seems that there is a lot of interest in fixing the “making stuff up” issue, and I tend to believe that if a lot of smart people gravitate towards an issue, it will probably get at least somewhat better over time.
As to “bad at arithmetic,” I agree, it definitely goes quite a bit farther than not being able to reliably compute 416*31 or something. I was able to get it to solve some very simple Sylow theorem type problems with a lot of hand-holding and pointing out its answers, but the failure you describe definitely sounds about right to me. It can say something like [P and Q and R] therefore [S]. Then you convince it [P] is false, and it will say [not P and Q and R] therefore [S], as though correcting [P] to [not P] fixes the argument (rather than breaking the implication). You may prompt it again, and it says “I apologize for the mistake! Here is a corrected version: [Q and R] therefore [S]” when [Q] and [R] are not nearly enough. I do not expect these things to be good theorem-provers or logicians without significant changes. I’m not even sure how you would coax a natural language model into doing formal reasoning competently. At one end of the spectrum, you could ham-string it to only produce strings that are well-formed in a formal system, only using valid rules of inference, but then it’s going to be so limited in the leaps it’s able to make that it will never be able to discuss high-level concepts with you. I don’t know how you get a stochastic parrot to map the word salad it’s generating onto something that resembles valid formal reasoning.
An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.
I think what I mean by “promising at math” is that I would assign a reasonably confident belief that a scaled up model with the same basic architecture as this one, with maybe a more mathematical training set, would be able to accomplish tasks that a mathematician might find helpful to a nonzero degree. For example, I could imagine a variant that is good at ‘concept-linking,’ better than google scholar at least, where you ask it something like “What are some ideas similar to valuations rings that people have explored?” or “Where do Gorenstein rings come up in combinatorics?” and getting a list of some interesting leads, some nonsense, and some lies and fabricated sources. If you try this with ChatGPT, you get mostly either (1) very well-known things that you could find by searching mathoverflow or stackexchange or (2) made-up sources that sound really cool and believable, until you figure out they aren’t real. I would imagine that prompt-engineering could weed out type (1), and that failure mode (2) is something we’ll see gradual progress in. It seems that there is a lot of interest in fixing the “making stuff up” issue, and I tend to believe that if a lot of smart people gravitate towards an issue, it will probably get at least somewhat better over time.
As to “bad at arithmetic,” I agree, it definitely goes quite a bit farther than not being able to reliably compute 416*31 or something. I was able to get it to solve some very simple Sylow theorem type problems with a lot of hand-holding and pointing out its answers, but the failure you describe definitely sounds about right to me. It can say something like [P and Q and R] therefore [S]. Then you convince it [P] is false, and it will say [not P and Q and R] therefore [S], as though correcting [P] to [not P] fixes the argument (rather than breaking the implication). You may prompt it again, and it says “I apologize for the mistake! Here is a corrected version: [Q and R] therefore [S]” when [Q] and [R] are not nearly enough. I do not expect these things to be good theorem-provers or logicians without significant changes. I’m not even sure how you would coax a natural language model into doing formal reasoning competently. At one end of the spectrum, you could ham-string it to only produce strings that are well-formed in a formal system, only using valid rules of inference, but then it’s going to be so limited in the leaps it’s able to make that it will never be able to discuss high-level concepts with you. I don’t know how you get a stochastic parrot to map the word salad it’s generating onto something that resembles valid formal reasoning.
An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.