and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning
The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.
The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.