Hastings comments on People aren’t properly calibrated on FrontierMath

Hastings 26 Dec 2024 15:03 UTC
5 points
−8
There’s an easy way to turn any mathematical answer-based benchmark into a proof-based benchmark and it doesn’t require coq or lean or any human formalization of the benchmark design: just let the model choose whether or not to submit an answer for each question, and score the model zero for the whole benchmark if it submits any wrong answers.