Not Tamay, but from elliotglazer on Reddit[1] (14h ago): “Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.”
Currently developing a hold-out dataset gives a different impression than
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities” and “they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.”
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): “Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.”
Currently developing a hold-out dataset gives a different impression than
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities” and “they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.”
Emphasis mine. He also mentions “the holdout set we are developing” on Twitter.