There’s an easy way to turn any mathematical answer-based benchmark into a proof-based benchmark and it doesn’t require coq or lean or any human formalization of the benchmark design: just let the model choose whether or not to submit an answer for each question, and score the model zero for the whole benchmark if it submits any wrong answers.
There’s an easy way to turn any mathematical answer-based benchmark into a proof-based benchmark and it doesn’t require coq or lean or any human formalization of the benchmark design: just let the model choose whether or not to submit an answer for each question, and score the model zero for the whole benchmark if it submits any wrong answers.