The example math proof in this paper is, as far as I can tell, wrong (despite being called a “correct proof” in the text). It doesn’t have a figure number, but it’s on page 40.
I could be mistaken, but the statement Then g(y*) < (y*)^2 , since otherwise we would have g(x) + g(y*) > 2xy, seems like complete nonsense to me. (The last use of y should probably be y* too, but if you’re being generous, you could call that a typo.) If you negate g(y*) < (y*)^2, you get g(y*) >= (y*)^2, and then g(x) + g(y*) >= g(x) + (y*)^2, but then what?
Messing up that step is a big deal too since it’s the trickiest part of the proof. If the proof writer were human, I’d wonder whether they have some line of reasoning in their head that I’m not following that makes that line make sense, but it seems certainly overly generous to apply that possibility to an auto-regressive model (where there is no reasoning aside from what you see in the output).
Interestingly, it’s not unusual for incorrect LLM-written proofs to be wrongly marked as correct. One of Minerva’s example proofs (shown in the box “Breaking Down Math” in the Quanta article) says “the square of a real number is positive”, which is false in general—the square of a real number is non-negative, but it can be zero too.
I’m not surprised that incorrect proofs are getting marked as correct, because it’s hard manual work to carefully grade proofs. Still, it makes me highly skeptical of LLM ability at writing natural language proofs. (Formal proofs, which are automatically checked, are different.)
As a constructive suggestion for how to improve the situation, I’d suggest that, in benchmarks, the questions should ask “prove or provide a counterexample”, and each question should come in (at least) two variants: one where it’s true, and one where an assumption has been slightly tweaked so that it’s false. (This is a trick I use when studying mathematics myself: to learn a theorem statement, try finding a counter-example that illustrates why each assumption is necessary.)
It’s a terribly organized and presented proof, but I think it’s basically right (although it’s skipping over some algebraic details, which is common in proofs). To spell it out:
Fix any x and y. We then have,
x2−2xy+y2=(x−y)2≥0.
Adding 2xy to both sides,
x2+y2≥2xy.
Therefore, if (by assumption in that line of the proof) g(x)>x2 and g(y)≥y2, we’d have,
g(x)+g(y)>x2+y2≥2xy,
which contradicts our assumption that g(x)+g(y)≤2xy.
Thanks. When it’s written as g(x)+g(y)>x2+y2≥2xy, I can see what’s going on. (That one intermediate step makes all the difference!)
I was wrong then to call the proof “incorrect”. I think it’s fair to call it “incomplete”, though. After all, it could have just said “the whole proof is an exercise for the reader”, which is in some sense correct I guess, but not very helpful (and doesn’t tell you much about the model’s ability), and this is a bit like that on a smaller scale.
(Although, reading again, ”...which contradicts the existence of y∗ given x” is a quite strange thing to say as well. I’m not sure I can exactly say it’s wrong, though. Really, that whole section makes my head hurt.)
If a human wrote this, I would be wondering if they actually understand the reasoning or are just skipping over a step they don’t know how to do. The reason I say that is that g(x)+g(y∗)>2xy∗ is the obvious contradiction to look for, so the section reads a bit like “I’d really like g(y∗)<(y∗)2 to be true, and surely there’s a contradiction somehow if it isn’t, but I don’t really know why, but this is probably the contradiction I’d get if I figured it out”. The typo-esque use of y instead of y∗ bolsters this impression.
It is correct, but I agree the reasoning is not written out properly. I’ll write y for y∗
We are in the case g(x)>x2. So if g(y)>=y2, then g(x)+g(y)>x2+y2>=2xy, where the last inequality is true because it’s equivalent to (x−y)2>=0. So we have g(x)+g(y)>2xy, which is a contradiction to the choice of y.
The second-to-last line of the proof is also very confusing. I think GPT-4 is using the same variable names to mean different variables, which leads it to say silly things like “the uniqueness of $y$ given $y$”.
The example math proof in this paper is, as far as I can tell, wrong (despite being called a “correct proof” in the text). It doesn’t have a figure number, but it’s on page 40.
I could be mistaken, but the statement
Then g(y*) < (y*)^2 , since otherwise we would have g(x) + g(y*) > 2xy,
seems like complete nonsense to me. (The last use ofy
should probably bey*
too, but if you’re being generous, you could call that a typo.) If you negateg(y*) < (y*)^2
, you getg(y*) >= (y*)^2
, and theng(x) + g(y*) >= g(x) + (y*)^2
, but then what?Messing up that step is a big deal too since it’s the trickiest part of the proof. If the proof writer were human, I’d wonder whether they have some line of reasoning in their head that I’m not following that makes that line make sense, but it seems certainly overly generous to apply that possibility to an auto-regressive model (where there is no reasoning aside from what you see in the output).
Interestingly, it’s not unusual for incorrect LLM-written proofs to be wrongly marked as correct. One of Minerva’s example proofs (shown in the box “Breaking Down Math” in the Quanta article) says “the square of a real number is positive”, which is false in general—the square of a real number is non-negative, but it can be zero too.
I’m not surprised that incorrect proofs are getting marked as correct, because it’s hard manual work to carefully grade proofs. Still, it makes me highly skeptical of LLM ability at writing natural language proofs. (Formal proofs, which are automatically checked, are different.)
As a constructive suggestion for how to improve the situation, I’d suggest that, in benchmarks, the questions should ask “prove or provide a counterexample”, and each question should come in (at least) two variants: one where it’s true, and one where an assumption has been slightly tweaked so that it’s false. (This is a trick I use when studying mathematics myself: to learn a theorem statement, try finding a counter-example that illustrates why each assumption is necessary.)
It’s a terribly organized and presented proof, but I think it’s basically right (although it’s skipping over some algebraic details, which is common in proofs). To spell it out:
Fix any x and y. We then have,
x2−2xy+y2=(x−y)2≥0.
Adding 2xy to both sides,
x2+y2≥2xy.
Therefore, if (by assumption in that line of the proof) g(x)>x2 and g(y)≥y2, we’d have,
g(x)+g(y)>x2+y2≥2xy,
which contradicts our assumption that g(x)+g(y)≤2xy.
Thanks. When it’s written as g(x)+g(y)>x2+y2≥2xy, I can see what’s going on. (That one intermediate step makes all the difference!)
I was wrong then to call the proof “incorrect”. I think it’s fair to call it “incomplete”, though. After all, it could have just said “the whole proof is an exercise for the reader”, which is in some sense correct I guess, but not very helpful (and doesn’t tell you much about the model’s ability), and this is a bit like that on a smaller scale.
(Although, reading again, ”...which contradicts the existence of y∗ given x” is a quite strange thing to say as well. I’m not sure I can exactly say it’s wrong, though. Really, that whole section makes my head hurt.)
If a human wrote this, I would be wondering if they actually understand the reasoning or are just skipping over a step they don’t know how to do. The reason I say that is that g(x)+g(y∗)>2xy∗ is the obvious contradiction to look for, so the section reads a bit like “I’d really like g(y∗)<(y∗)2 to be true, and surely there’s a contradiction somehow if it isn’t, but I don’t really know why, but this is probably the contradiction I’d get if I figured it out”. The typo-esque use of y instead of y∗ bolsters this impression.
It is correct, but I agree the reasoning is not written out properly. I’ll write y for y∗
We are in the case g(x)>x2. So if g(y)>=y2, then g(x)+g(y)>x2+y2>=2xy, where the last inequality is true because it’s equivalent to (x−y)2>=0. So we have g(x)+g(y)>2xy, which is a contradiction to the choice of y.
The second-to-last line of the proof is also very confusing. I think GPT-4 is using the same variable names to mean different variables, which leads it to say silly things like “the uniqueness of $y$ given $y$”.