Just a few quick comments about my “integer whose square is between 15 and 30” question (search for my name in Zvi’s post to find his discussion):
The phrasing of the question I now prefer is “What is the least integer whose square is between 15 and 30”, because that makes it unambiguous that the answer is −5 rather than 4. (This is a normal use of the word “least”, e.g. in competition math, that the model is familiar with.) This avoids ambiguity about which of −5 and 4 is “smaller”, since −5 is less but 4 is smaller in magnitude.
This Gemini model answers −5 to both phrasings. As far as I know, no previous model ever said −5 regardless of phrasing, although someone said o1 Pro gets −5. (I don’t have a subscription to o1 Pro, so I can’t independently check.)
I’m fairly confident that a majority of elite math competitors (top 500 in the US, say) would get this question right in a math competition (although maybe not in a casual setting where they aren’t on their toes).
But also this is a silly, low-quality question that wouldn’t appear in a math competition.
Does a model getting this question right say anything interesting about it? I think a little. There’s a certain skill of being careful to not make assumptions (e.g. that the integer is positive). Math competitors get better at this skill over time. It’s not that straightforward to learn.
I’m a little confused about why Zvi says that the model gets it right in the screenshot, given that the model’s final answer is 4. But it seems like the model snatched defeat from the jaws of victory? Like if you cut off the very last sentence, I would call it correct.
Just a few quick comments about my “integer whose square is between 15 and 30” question (search for my name in Zvi’s post to find his discussion):
The phrasing of the question I now prefer is “What is the least integer whose square is between 15 and 30”, because that makes it unambiguous that the answer is −5 rather than 4. (This is a normal use of the word “least”, e.g. in competition math, that the model is familiar with.) This avoids ambiguity about which of −5 and 4 is “smaller”, since −5 is less but 4 is smaller in magnitude.
This Gemini model answers −5 to both phrasings. As far as I know, no previous model ever said −5 regardless of phrasing, although someone said o1 Pro gets −5. (I don’t have a subscription to o1 Pro, so I can’t independently check.)
I’m fairly confident that a majority of elite math competitors (top 500 in the US, say) would get this question right in a math competition (although maybe not in a casual setting where they aren’t on their toes).
But also this is a silly, low-quality question that wouldn’t appear in a math competition.
Does a model getting this question right say anything interesting about it? I think a little. There’s a certain skill of being careful to not make assumptions (e.g. that the integer is positive). Math competitors get better at this skill over time. It’s not that straightforward to learn.
I’m a little confused about why Zvi says that the model gets it right in the screenshot, given that the model’s final answer is 4. But it seems like the model snatched defeat from the jaws of victory? Like if you cut off the very last sentence, I would call it correct.
Here’s the output I get:
I got this one wrong too. Ignoring negative roots is pretty common for non-mathematicians.
I’m half convinced that most of the lesswrong commenters wouldn’t pass as AGI if uploaded.
Yup, I think that only about 10-15% of LWers would get this question right.