Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly training on it (e.g. maybe they could use to evaluate process-based grading approaches).
(FWIW I’m explaining the negatives, but I disagree with the comment I’m expanding on regarding the sign of Frontier Math, seems positive EV to me despite the concerns)
I’m guessing you view having better understanding of what’s coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.
Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it’s unclear to me how having FrontierMath affects takeoff speeds.
I wasn’t thinking much about the optics, but I’d guess that’s not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn’t focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.
Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly training on it (e.g. maybe they could use to evaluate process-based grading approaches).
(FWIW I’m explaining the negatives, but I disagree with the comment I’m expanding on regarding the sign of Frontier Math, seems positive EV to me despite the concerns)
I’m guessing you view having better understanding of what’s coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.
Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it’s unclear to me how having FrontierMath affects takeoff speeds.
I wasn’t thinking much about the optics, but I’d guess that’s not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn’t focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.