The jump in capabilities from GPT-3 to GPT-4 seems like much much less impressive than jump from GPT-2 to GPT-3. Part of that is likely because later version of GPT-3 were noticeably smarter than the first ones, but that reason doesn’t seem sufficient to me. So what’s up? Should I expect that GPT-4 → GPT-5 will be barely noticeable?
In particular I am rather surprised at apparent lack in ability to solve nonstandard math problems. I didn’t expect it to beat IMO, but I did expect that problems for 12 y/o would be accessible, and they weren’t. (I personally tried only Bing, so perhaps usual GPT-4 is better. But I’ve seen only one successful attempt with GPT-4, and it was mostly trigonometry.). So what’s up? I am tempted to say that math is just harder than economy, biology, etc. But that’s likely not it.
I didn’t say “it’s worse than 12 yo at any math task”. I meant nonstandard problems. Perhaps that’s wrong English terminology? Sort of easy olympiad problem?
The actual test that I performed was “take several easy problems from a math circle for 12 y/o and try various ‘lets think tep-by-step’ to make Bing write solutions”.
Example of such a problem:
Between 20 poles, several ropes are stretched (each rope connects two different poles; there is no more than one rope between any two poles). It is known that at least 15 ropes are attached to each pole. The poles are divided into groups so that each rope connects poles from different groups. Prove that there are at least four groups.
Yeah, you are right. It seems that it was actually one of the harder ones I tried. This particular problem was solved by 4 of 28 members of a relatively strong group. I distinctly remember also trying some easy problems from a relatively weak group, but I don’t have notes and Bing don’t save chat.
I guess I should just try again, especially in light of gwillen’s comment. (By the way, if somebody with access to actual GPT-4 is willing to help me with testing it on some math problems, I’d really appreacite it .)
It’s extremely important in discussions like this to be sure of what model you’re talking to. Last I heard, Bing in the default “balanced” mode had been switched to GPT-3.5, presumably as a cost saving measure.
That would explain a lot. I’ve heard this rumor, but when I tried to trace the source, i haven’t found anything better than guesses. So I dismissed it, but maybe I shouldn’t have. Do you have a better source?
For 30 % of tasks, users actually prefer 3 over 4. For many tasks, the output will barely vary. Yet there are some where the output changed drastically and for the better. If you aren’t noticing it, these were not your area of focus. A lot of it concerns things like psychological bias and deception, tricks children fall for and adults spot. Also spatial reasoning, visual reasoning.
LLMs are terrible at math. Not because it is harder, but because the principles are different, and machine learning is a shitty way to learn them. The very thing that makes it good at poetry makes it suck at math. They can’t even count the words in a text accurately. This will likely not improve that much from improving LLMs themselves—the solution is external plug-ins, e.g. into Wolfram Alpha, which are already being done.
My girlfriend had moderate success getting it to work on theoretical physics concepts, after extensive prompting for being more technical, and guiding through steps. If you like math, that might be more interesting for you.
I agree that there are some impressive improvements from GPT-3 to GPT-4. But they seem to me a lot less impressive than jump from GPT-2 producing barely coherent texts to GPT-3 (somewhat) figuring out how to play chess.
I disagree with you take on LLM’s math abilities. Wolfram Alpha helps with tasks like SAT—and GPT-4 is doing well enough on them. But for some reason it (at least in the incarnation of Bing) has trouble with simple logic puzzles like the one I mentioned in other comment.
Can you tell more about success with theoretical physics concepts? I don’t think I’ve seen anybody try that.
Not coherently, no. My girlfriend is a theoretical physics and theory of machine learning prof, my understanding of her work is extremely fuzzy. But she was stuck on something where I was being a rubber ducky, which is tricky insofar as I barely understand what she does, and I proposed talking to ChatGPT. She basically entered the problem she was stuck on (her suspicion that two different things were related somehow, though she couldn’t quite pinpoint how). It took some tweaking—at first, it was super superficial, giving an explanation more suited for wikipedia or school homework than getting to the actual science, she needed to push it over and over to finally get equations and not just superficial explanations, unconnected. And at the time, the internet plugin was not out, so the lack of access to recent papers was a problem. But she said eventually, it spat out some accurate equations (though also the occasional total nonsense), made a bunch of connections between concepts that were accurate (though it could not always correctly identify why), and made some proposals for connections that she at least found promising. She was very intrigued by its ability to spot those connections; in some ways, it seemed to replicate the intuition an advanced physicist eventually obtains. She compared the experience with talking to an A-star Bachelor student who has memorised all the concepts and is very well read, but if you start prodding, often has not truly understood them; and yet suddenly makes some connections that should be vastly beyond them, though unable to properly explain why. She still found it helpful and interesting. - I am still under the impression it does much worse in this area than in e.g. biology or computer science.
With the logic puzzle, the technical report on ChatGPT4 also seems confused at that. They had some logic puzzles which ChatGPT failed at, and where they got worse and worse with each iteration, only to suddenly learn them with no warning. I haven’t spotted the pattern yet, but can only say that it reminds me strongly of mistakes you see in young children lacking advanced theory of mind and time perception. E.g. it has huge difficulties with the idea that it needs to judge a past situation even though it now has more knowledge without getting biased by it, or that it needs to have knowledge but withhold it from another person to win a game. As humans, we tend to forget that these are very advanced skills, because we excel at them so.
Two questions about capabilities of GPT-4.
The jump in capabilities from GPT-3 to GPT-4 seems like much much less impressive than jump from GPT-2 to GPT-3. Part of that is likely because later version of GPT-3 were noticeably smarter than the first ones, but that reason doesn’t seem sufficient to me. So what’s up? Should I expect that GPT-4 → GPT-5 will be barely noticeable?
In particular I am rather surprised at apparent lack in ability to solve nonstandard math problems. I didn’t expect it to beat IMO, but I did expect that problems for 12 y/o would be accessible, and they weren’t. (I personally tried only Bing, so perhaps usual GPT-4 is better. But I’ve seen only one successful attempt with GPT-4, and it was mostly trigonometry.). So what’s up? I am tempted to say that math is just harder than economy, biology, etc. But that’s likely not it.
GPT4 scored 700⁄800 at the SAT math test. I don’t think a 12 year old gets such a score.
I didn’t say “it’s worse than 12 yo at any math task”. I meant nonstandard problems. Perhaps that’s wrong English terminology? Sort of easy olympiad problem?
The actual test that I performed was “take several easy problems from a math circle for 12 y/o and try various ‘lets think tep-by-step’ to make Bing write solutions”.
Example of such a problem:
Between 20 poles, several ropes are stretched (each rope connects two different poles; there is no more than one rope between any two poles). It is known that at least 15 ropes are attached to each pole. The poles are divided into groups so that each rope connects poles from different groups. Prove that there are at least four groups.
Most 12-year-olds are not going to be able to solve that problem.
Yeah, you are right. It seems that it was actually one of the harder ones I tried. This particular problem was solved by 4 of 28 members of a relatively strong group. I distinctly remember also trying some easy problems from a relatively weak group, but I don’t have notes and Bing don’t save chat.
I guess I should just try again, especially in light of gwillen’s comment. (By the way, if somebody with access to actual GPT-4 is willing to help me with testing it on some math problems, I’d really appreacite it .)
It’s extremely important in discussions like this to be sure of what model you’re talking to. Last I heard, Bing in the default “balanced” mode had been switched to GPT-3.5, presumably as a cost saving measure.
That would explain a lot. I’ve heard this rumor, but when I tried to trace the source, i haven’t found anything better than guesses. So I dismissed it, but maybe I shouldn’t have. Do you have a better source?
For 30 % of tasks, users actually prefer 3 over 4. For many tasks, the output will barely vary. Yet there are some where the output changed drastically and for the better. If you aren’t noticing it, these were not your area of focus. A lot of it concerns things like psychological bias and deception, tricks children fall for and adults spot. Also spatial reasoning, visual reasoning.
LLMs are terrible at math. Not because it is harder, but because the principles are different, and machine learning is a shitty way to learn them. The very thing that makes it good at poetry makes it suck at math. They can’t even count the words in a text accurately. This will likely not improve that much from improving LLMs themselves—the solution is external plug-ins, e.g. into Wolfram Alpha, which are already being done.
My girlfriend had moderate success getting it to work on theoretical physics concepts, after extensive prompting for being more technical, and guiding through steps. If you like math, that might be more interesting for you.
I agree that there are some impressive improvements from GPT-3 to GPT-4. But they seem to me a lot less impressive than jump from GPT-2 producing barely coherent texts to GPT-3 (somewhat) figuring out how to play chess.
I disagree with you take on LLM’s math abilities. Wolfram Alpha helps with tasks like SAT—and GPT-4 is doing well enough on them. But for some reason it (at least in the incarnation of Bing) has trouble with simple logic puzzles like the one I mentioned in other comment.
Can you tell more about success with theoretical physics concepts? I don’t think I’ve seen anybody try that.
Not coherently, no. My girlfriend is a theoretical physics and theory of machine learning prof, my understanding of her work is extremely fuzzy. But she was stuck on something where I was being a rubber ducky, which is tricky insofar as I barely understand what she does, and I proposed talking to ChatGPT. She basically entered the problem she was stuck on (her suspicion that two different things were related somehow, though she couldn’t quite pinpoint how). It took some tweaking—at first, it was super superficial, giving an explanation more suited for wikipedia or school homework than getting to the actual science, she needed to push it over and over to finally get equations and not just superficial explanations, unconnected. And at the time, the internet plugin was not out, so the lack of access to recent papers was a problem. But she said eventually, it spat out some accurate equations (though also the occasional total nonsense), made a bunch of connections between concepts that were accurate (though it could not always correctly identify why), and made some proposals for connections that she at least found promising. She was very intrigued by its ability to spot those connections; in some ways, it seemed to replicate the intuition an advanced physicist eventually obtains. She compared the experience with talking to an A-star Bachelor student who has memorised all the concepts and is very well read, but if you start prodding, often has not truly understood them; and yet suddenly makes some connections that should be vastly beyond them, though unable to properly explain why. She still found it helpful and interesting. - I am still under the impression it does much worse in this area than in e.g. biology or computer science.
With the logic puzzle, the technical report on ChatGPT4 also seems confused at that. They had some logic puzzles which ChatGPT failed at, and where they got worse and worse with each iteration, only to suddenly learn them with no warning. I haven’t spotted the pattern yet, but can only say that it reminds me strongly of mistakes you see in young children lacking advanced theory of mind and time perception. E.g. it has huge difficulties with the idea that it needs to judge a past situation even though it now has more knowledge without getting biased by it, or that it needs to have knowledge but withhold it from another person to win a game. As humans, we tend to forget that these are very advanced skills, because we excel at them so.