I found this failure to be interesting, unexpected (to me), and it was honestly frustrating to watch Claude get it wrong over and over again. It seems like this deserves to be received by people smarter and more important than me.
I found your writing style to be off putting and confusing, which seems counterproductive given you seem to have put a lot of work into this benchmark.
I sincerely recommend using Claude to rewrite this post and putting the actual results of the benchmark in the style of a long post or research paper.
It’s not worth much but I’ll commit to strong upvoting it and posting it on my twitter if you do so.
Offputting: Why 4 em dashes in your title? Why does the tone, word choice, and style switch between fancy and not so often? Why the typoes? Claiming something is 50 times lower than commonly believed, redefining “times”, and then minimally supporting that redefinition seems fishy. Not actually giving the results in an understandable format (in this post, not in your benchmark where you seem to have done a really good job backing this up).
Confusing: What is the numbered list of ways you could come up with these questions? It seem like you are describing increasingly malfeasant ways to do so, but I can’t tell. Why not show some example responses from the LLM’s and/or explain their error modes? Telll us how you made these questions. What was your method for coming up with the formula you are using? etc.
Claude would genuinely fix most of these problems—run the post past him! He may not be so good at reasoning as I thought, but he is really good at writing things.
The failures seem to be often related to the model get stuck trying to reason about your problem in a way that pattern matches too strongly to similar problems, and that is why it is failing. Did you notice this as well?