I feel that “promising at math” is a bit too optimistic for ChatGPT in its current form, and “bad at arithmetic” understates its deficits.
(I should also acknowledge that it’s an astonishing achievement to have created something good enough that that sentence, as opposed to something like “of course all it produces is meaningless word salad”, is one worth saying at all.)
“Bad at arithmetic” understates it because ChatGPT’s failures of mathematical reasoning aren’t just about being rubbish at arithmetic. It’s bad at logic. For instance, I asked it to give examples of two non-isomorphic groups of order 5 (which of course it can’t; there is only one isomorphism class of groups of order 5). It said something along the lines of “Yes, I can do that. One of the groups is the cyclic group of order 5, which [… blah description blah …]. The other is the dihedral group D_5, which [… blah description blah, including mentioning that it has order 10 …]. These groups are not isomorphic because they are of different order.” I pointed out that these groups are blatantly not both of order 5, and it said oops and provided another example of exactly the same type: two groups, one of which is C5 and another of different order. After repeating this process maybe four times, with different bogus examples each time, I pointed out the pattern and asked whether it was sure that an example exists, and it just repeatedly doubled down. This isn’t about being bad at arithmetic, it’s about piecing together fragments of reasoning without any sort of higher-level coherence between them. I am more sympathetic to the “stochastic parrot” view of LLMs after trying this experiment.
As for “promising at math”: maaaaaybe, at least in the sense of “produces high-level word salad of a sort that might occasionally happen to contain interesting mathematical concepts”. But I’d want to see more examples, and preferably better ones than the one here (I dunno, maybe n-coherence is more interesting than it sounds? It sounds pretty unlikely to be worth studying, to me). I am aware that “can produce genuinely interesting new mathematical ideas at non-negligible rate” is a really high bar to clear, of course...
I think what I mean by “promising at math” is that I would assign a reasonably confident belief that a scaled up model with the same basic architecture as this one, with maybe a more mathematical training set, would be able to accomplish tasks that a mathematician might find helpful to a nonzero degree. For example, I could imagine a variant that is good at ‘concept-linking,’ better than google scholar at least, where you ask it something like “What are some ideas similar to valuations rings that people have explored?” or “Where do Gorenstein rings come up in combinatorics?” and getting a list of some interesting leads, some nonsense, and some lies and fabricated sources. If you try this with ChatGPT, you get mostly either (1) very well-known things that you could find by searching mathoverflow or stackexchange or (2) made-up sources that sound really cool and believable, until you figure out they aren’t real. I would imagine that prompt-engineering could weed out type (1), and that failure mode (2) is something we’ll see gradual progress in. It seems that there is a lot of interest in fixing the “making stuff up” issue, and I tend to believe that if a lot of smart people gravitate towards an issue, it will probably get at least somewhat better over time.
As to “bad at arithmetic,” I agree, it definitely goes quite a bit farther than not being able to reliably compute 416*31 or something. I was able to get it to solve some very simple Sylow theorem type problems with a lot of hand-holding and pointing out its answers, but the failure you describe definitely sounds about right to me. It can say something like [P and Q and R] therefore [S]. Then you convince it [P] is false, and it will say [not P and Q and R] therefore [S], as though correcting [P] to [not P] fixes the argument (rather than breaking the implication). You may prompt it again, and it says “I apologize for the mistake! Here is a corrected version: [Q and R] therefore [S]” when [Q] and [R] are not nearly enough. I do not expect these things to be good theorem-provers or logicians without significant changes. I’m not even sure how you would coax a natural language model into doing formal reasoning competently. At one end of the spectrum, you could ham-string it to only produce strings that are well-formed in a formal system, only using valid rules of inference, but then it’s going to be so limited in the leaps it’s able to make that it will never be able to discuss high-level concepts with you. I don’t know how you get a stochastic parrot to map the word salad it’s generating onto something that resembles valid formal reasoning.
An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.
I feel that “promising at math” is a bit too optimistic for ChatGPT in its current form, and “bad at arithmetic” understates its deficits.
(I should also acknowledge that it’s an astonishing achievement to have created something good enough that that sentence, as opposed to something like “of course all it produces is meaningless word salad”, is one worth saying at all.)
“Bad at arithmetic” understates it because ChatGPT’s failures of mathematical reasoning aren’t just about being rubbish at arithmetic. It’s bad at logic. For instance, I asked it to give examples of two non-isomorphic groups of order 5 (which of course it can’t; there is only one isomorphism class of groups of order 5). It said something along the lines of “Yes, I can do that. One of the groups is the cyclic group of order 5, which [… blah description blah …]. The other is the dihedral group D_5, which [… blah description blah, including mentioning that it has order 10 …]. These groups are not isomorphic because they are of different order.” I pointed out that these groups are blatantly not both of order 5, and it said oops and provided another example of exactly the same type: two groups, one of which is C5 and another of different order. After repeating this process maybe four times, with different bogus examples each time, I pointed out the pattern and asked whether it was sure that an example exists, and it just repeatedly doubled down. This isn’t about being bad at arithmetic, it’s about piecing together fragments of reasoning without any sort of higher-level coherence between them. I am more sympathetic to the “stochastic parrot” view of LLMs after trying this experiment.
As for “promising at math”: maaaaaybe, at least in the sense of “produces high-level word salad of a sort that might occasionally happen to contain interesting mathematical concepts”. But I’d want to see more examples, and preferably better ones than the one here (I dunno, maybe n-coherence is more interesting than it sounds? It sounds pretty unlikely to be worth studying, to me). I am aware that “can produce genuinely interesting new mathematical ideas at non-negligible rate” is a really high bar to clear, of course...
I think what I mean by “promising at math” is that I would assign a reasonably confident belief that a scaled up model with the same basic architecture as this one, with maybe a more mathematical training set, would be able to accomplish tasks that a mathematician might find helpful to a nonzero degree. For example, I could imagine a variant that is good at ‘concept-linking,’ better than google scholar at least, where you ask it something like “What are some ideas similar to valuations rings that people have explored?” or “Where do Gorenstein rings come up in combinatorics?” and getting a list of some interesting leads, some nonsense, and some lies and fabricated sources. If you try this with ChatGPT, you get mostly either (1) very well-known things that you could find by searching mathoverflow or stackexchange or (2) made-up sources that sound really cool and believable, until you figure out they aren’t real. I would imagine that prompt-engineering could weed out type (1), and that failure mode (2) is something we’ll see gradual progress in. It seems that there is a lot of interest in fixing the “making stuff up” issue, and I tend to believe that if a lot of smart people gravitate towards an issue, it will probably get at least somewhat better over time.
As to “bad at arithmetic,” I agree, it definitely goes quite a bit farther than not being able to reliably compute 416*31 or something. I was able to get it to solve some very simple Sylow theorem type problems with a lot of hand-holding and pointing out its answers, but the failure you describe definitely sounds about right to me. It can say something like [P and Q and R] therefore [S]. Then you convince it [P] is false, and it will say [not P and Q and R] therefore [S], as though correcting [P] to [not P] fixes the argument (rather than breaking the implication). You may prompt it again, and it says “I apologize for the mistake! Here is a corrected version: [Q and R] therefore [S]” when [Q] and [R] are not nearly enough. I do not expect these things to be good theorem-provers or logicians without significant changes. I’m not even sure how you would coax a natural language model into doing formal reasoning competently. At one end of the spectrum, you could ham-string it to only produce strings that are well-formed in a formal system, only using valid rules of inference, but then it’s going to be so limited in the leaps it’s able to make that it will never be able to discuss high-level concepts with you. I don’t know how you get a stochastic parrot to map the word salad it’s generating onto something that resembles valid formal reasoning.
An idea that feels promising to me for proving things using LLMs, though I don’t know whether it actually is, is something along the following lines.
Completely rigorous automated proof-checking system along the lines of Lean.
Language-model-y translator from informal mathematical statements plus purely-formalized context into formalized equivalents of those statements.
Language-model-y system generating highly-unreliable proposals for the next thing to try in a proof.
(If necessary given the foregoing:) Traditional good-old-fashioned-AI-type search algorithm attempting to build proofs using LLM 2 to generate candidate steps, using LLM 1 to translate them into formalism, and using Lean-or-whatever to see if the candidate step is easy to prove from what’s been proved already.
This is not so very different from what human mathematicians do, and it doesn’t seem super-improbable that a modest amount of improvement to the models might be sufficient to make it work pretty well.
[EDITED to add:] One thing I don’t like about this sort of separate-models-and-duct-tape approach is the lack of integration: the language model doesn’t really know what the proving-bits are thinking. I don’t know how much difference this actually matters, but it feels a bit iffy.