On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)
In the post it is also clarified that the holdout set (50 problems) is not complete.
The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and some 8 min) before the o3 stream on YT, on the same date. Technically ahead of the o3 announcement, though. I was wrong in saying that the Arxiv version came only after the o3 announcement, this was only a guess based on the date being the same. I could have known better, and could have checked the clocktime.[1]
Nat McAleese tweeted that “we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all.” This is nice. If true, then I was wrong and misleading to spread a rumor that they use it for validation. Validation can mean different things, including using it for hill-climbing capabilities.
Nat McAleese also says that “hard uncontaminated benchmarks are incredibly valuable”. I don’t really know what the mechanisms for them being valuable are. I know that OpenAI knows better than I can get to know, but I appreciate discussion here. I would have thought that the value would be mostly a combination of marketing + guiding development of current or future models (in various ways, with training on the dataset being a lower bound for capability usefulness), and that the marketing value wouldn’t really be different if you own the dataset or not. That’s why I’m expecting this dataset to be used in guiding the development of future models. I would love to learn more here.[2]
Something that I haven’t really seen mentioned is that a lot of people are willing to work for less compensation if working for a (mostly OpenPhil-funded) non-profit, compared to working on a project commisioned by OpenAI. This is another angle that makes me sad about the non-transparency, but this is relatively minor in my view.
The blog post did not discuss the Tier 4 problems. I’m guessingOh I’m so sorry. No, it says tier 4 problems will be owned by OpenAI.
I’m somewhat disappointed that they did not address this benchmark being marketed as secure and “eliminating data contamination concerns”. I think this marketing[3] means that them saying “we have not communicated clearly enough about the relationship between FrontierMath and OpenAI” [4] is understating the problem.[5]
Tamay’s tweet thanking OpenAI for their support was also on Dec 20th, didn’t check the clocktime. I don’t know when they added it to their website. Tweet says OpenAI “recently” provided permission to publicly share the support.
Our agreement did not prevent us from disclosing to our contributors that this work was sponsored by an AI company. Many contributors were unaware of these details, and our communication with them should have been more systematic and transparent.
So… why did they not disclose to their contributors that this work was sponsored by an AI company?
Specifically, not just any AI company, but the AI company that has (deservedly) perhaps the worst rep among all the frontier AI companies.[1]
I can’t help but think that some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
I can’t help but think that many more would decline the offer had they been told that it was sponsored by OpenAI specifically.
I can’t help but think that this is the reason why they were not informed.
some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
This is definitely true. There were ~100 mathematicians working on this (we don’t know how many of them knew) and there’s this.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.
I’d guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like 20?) people employed at Epoch AI, and multiple people at Epoch AI working on this project. Maybe the person or people communicating to the mathematicians were not in the meetings with OpenAI, or weren’t actively thinking about the details or implications of the agreement, when their job was to recruit people, and in turn the people who thought about the full details also missed communicating to the mathematicians. Or something like that, it’s a possibility, coordination is hard.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic.
I’m not necessarily implying that they explicitly/deliberately coordinated on this.
Perhaps there was no explicit “don’t mention OpenAI” policy but there was no “person X is responsible for ensuring that mathematicians know about OpenAI’s involvement” policy either.
But given that some of the mathematicians haven’t heard a word about OpenAI’s involvement from the Epoch team, it seems like Epoc at least had a reason not to mention OpenAI’s involvement (though this depends on how extensive communication between the two sides was). Possibly because they were aware of how they might react, both before the project started, as well as in the middle of it.
[ETA: In short, I would have expected this information to reach the mathematicians with high probability, unless the Epoch team had been disinclined to inform the mathematicians.]
Obviously, I’m just speculating here and the non-Epoch mathematicians involved in creation of FrontierMath know better than whatever I might speculate out of this.
“we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all.”
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn’t guided at all by FrontierMath.
https://epoch.ai/blog/openai-and-frontiermath
On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)
In the post it is also clarified that the holdout set (50 problems) is not complete.
The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and some 8 min) before the o3 stream on YT, on the same date. Technically ahead of the o3 announcement, though. I was wrong in saying that the Arxiv version came only after the o3 announcement, this was only a guess based on the date being the same. I could have known better, and could have checked the clocktime.[1]
Nat McAleese tweeted that “we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all.” This is nice. If true, then I was wrong and misleading to spread a rumor that they use it for validation. Validation can mean different things, including using it for hill-climbing capabilities.
Nat McAleese also says that “hard uncontaminated benchmarks are incredibly valuable”. I don’t really know what the mechanisms for them being valuable are. I know that OpenAI knows better than I can get to know, but I appreciate discussion here. I would have thought that the value would be mostly a combination of marketing + guiding development of current or future models (in various ways, with training on the dataset being a lower bound for capability usefulness), and that the marketing value wouldn’t really be different if you own the dataset or not. That’s why I’m expecting this dataset to be used in guiding the development of future models. I would love to learn more here.[2]
Something that I haven’t really seen mentioned is that a lot of people are willing to work for less compensation if working for a (mostly OpenPhil-funded) non-profit, compared to working on a project commisioned by OpenAI. This is another angle that makes me sad about the non-transparency, but this is relatively minor in my view.
The blog post did not discuss the Tier 4 problems. I’m guessingOh I’m so sorry. No, it says tier 4 problems will be owned by OpenAI.I’m somewhat disappointed that they did not address this benchmark being marketed as secure and “eliminating data contamination concerns”. I think this marketing[3] means that them saying “we have not communicated clearly enough about the relationship between FrontierMath and OpenAI” [4] is understating the problem.[5]
Tamay’s tweet thanking OpenAI for their support was also on Dec 20th, didn’t check the clocktime. I don’t know when they added it to their website. Tweet says OpenAI “recently” provided permission to publicly share the support.
Please come up with incredibly valuable uses of hard uncontaminated datasets that don’t guide development at all.
Well, that and the agreement with OpenAI including not disclosing the relationship.
As the main issue. (source: the blog post linked above.)
This last point I’m uncertain about mentioning about, and here I’m most likely to go wrong.
In the post:
So… why did they not disclose to their contributors that this work was sponsored by an AI company?
Specifically, not just any AI company, but the AI company that has (deservedly) perhaps the worst rep among all the frontier AI companies.[1]
I can’t help but think that some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
I can’t help but think that many more would decline the offer had they been told that it was sponsored by OpenAI specifically.
I can’t help but think that this is the reason why they were not informed.
Though Meta also has a legitimate claim to having the worst rep, albeit with different axes of worseness contributing to their overall score.
This is definitely true. There were ~100 mathematicians working on this (we don’t know how many of them knew) and there’s this.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.
I’d guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like 20?) people employed at Epoch AI, and multiple people at Epoch AI working on this project. Maybe the person or people communicating to the mathematicians were not in the meetings with OpenAI, or weren’t actively thinking about the details or implications of the agreement, when their job was to recruit people, and in turn the people who thought about the full details also missed communicating to the mathematicians. Or something like that, it’s a possibility, coordination is hard.
I’m not necessarily implying that they explicitly/deliberately coordinated on this.
Perhaps there was no explicit “don’t mention OpenAI” policy but there was no “person X is responsible for ensuring that mathematicians know about OpenAI’s involvement” policy either.
But given that some of the mathematicians haven’t heard a word about OpenAI’s involvement from the Epoch team, it seems like Epoc at least had a reason not to mention OpenAI’s involvement (though this depends on how extensive communication between the two sides was). Possibly because they were aware of how they might react, both before the project started, as well as in the middle of it.
[ETA: In short, I would have expected this information to reach the mathematicians with high probability, unless the Epoch team had been disinclined to inform the mathematicians.]
Obviously, I’m just speculating here and the non-Epoch mathematicians involved in creation of FrontierMath know better than whatever I might speculate out of this.
For those skeptical about
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn’t guided at all by FrontierMath.