20 year old AI safety researcher.
meemi
some of the contributors would decline the offer to contribute had they been told that it was sponsored by an AI capabilities company.
This is definitely true. There were ~100 mathematicians working on this (we don’t know how many of them knew) and there’s this.
I interpret you as insinuating that not disclosing that it was a project commissioned by industry was strategic. It might not have been, or maybe to some extent but not as much as one might think.
I’d guess not everyone involved was modeling how the mathematicians would feel. There are multiple (like 20?) people employed at Epoch AI, and multiple people at Epoch AI working on this project. Maybe the person or people communicating to the mathematicians were not in the meetings with OpenAI, or weren’t actively thinking about the details or implications of the agreement, when their job was to recruit people, and in turn the people who thought about the full details also missed communicating to the mathematicians. Or something like that, it’s a possibility, coordination is hard.
https://epoch.ai/blog/openai-and-frontiermath
On Twitter Dec 20th Tamay said the holdout set was independently funded. This blog post from today says OpenAI still owns the holdout set problems. (And that OpenAI has access to the questions but not solutions.)
In the post it is also clarified that the holdout set (50 problems) is not complete.
The blog post says Epoch requested permission ahead of the benchmark announcement (Nov 7th), and they got it ahead of the o3 announcement (Dec 20th). From me looking at timings, the Arxiv paper was updated 7 hours (and some 8 min) before the o3 stream on YT, on the same date. Technically ahead of the o3 announcement, though. I was wrong in saying that the Arxiv version came only after the o3 announcement, this was only a guess based on the date being the same. I could have known better, and could have checked the clocktime.[1]
Nat McAleese tweeted that “we [OpenAI] did not use FrontierMath data to guide the development of o1 or o3, at all.” This is nice. If true, then I was wrong and misleading to spread a rumor that they use it for validation. Validation can mean different things, including using it for hill-climbing capabilities.
Nat McAleese also says that “hard uncontaminated benchmarks are incredibly valuable”. I don’t really know what the mechanisms for them being valuable are. I know that OpenAI knows better than I can get to know, but I appreciate discussion here. I would have thought that the value would be mostly a combination of marketing + guiding development of current or future models (in various ways, with training on the dataset being a lower bound for capability usefulness), and that the marketing value wouldn’t really be different if you own the dataset or not. That’s why I’m expecting this dataset to be used in guiding the development of future models. I would love to learn more here.[2]
Something that I haven’t really seen mentioned is that a lot of people are willing to work for less compensation if working for a (mostly OpenPhil-funded) non-profit, compared to working on a project commisioned by OpenAI. This is another angle that makes me sad about the non-transparency, but this is relatively minor in my view.
The blog post did not discuss the Tier 4 problems. I’m guessingOh I’m so sorry. No, it says tier 4 problems will be owned by OpenAI.I’m somewhat disappointed that they did not address this benchmark being marketed as secure and “eliminating data contamination concerns”. I think this marketing[3] means that them saying “we have not communicated clearly enough about the relationship between FrontierMath and OpenAI” [4] is understating the problem.[5]
- ^
Tamay’s tweet thanking OpenAI for their support was also on Dec 20th, didn’t check the clocktime. I don’t know when they added it to their website. Tweet says OpenAI “recently” provided permission to publicly share the support.
- ^
Please come up with incredibly valuable uses of hard uncontaminated datasets that don’t guide development at all.
- ^
Well, that and the agreement with OpenAI including not disclosing the relationship.
- ^
As the main issue. (source: the blog post linked above.)
- ^
This last point I’m uncertain about mentioning about, and here I’m most likely to go wrong.
- ^
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): “Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.”
Currently developing a hold-out dataset gives a different impression than
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities” and “they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.”
That was a quote from a commenter in Hacker news, not my view. I reference the comment as something I thought a lot of people’s impression was pre- Dec 20th. You may be right that maybe most people didn’t have the impression that it’s unlikely, or that maybe they didn’t have a reason to think that. I don’t really know.
Thanks, I’ll put the quote in italics so it’s clearer.
FrontierMath was funded by OpenAI.[1]
The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.
Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]
Because the Arxiv version mentioning OpenAI contribution came out right after o3 announcement, I’d guess Epoch AI had some agreement with OpenAI to not mention it publicly until then.
The mathematicians creating the problems for FrontierMath were not (actively)[2] communicated to about funding from OpenAI. The contractors were instructed to be secure about the exercises and their solutions, including not using Overleaf or Colab or emailing about the problems, and signing NDAs, “to ensure the questions remain confidential” and to avoid leakage. The contractors were also not communicated to about OpenAI funding on December 20th. I believe there were named authors of the paper that had no idea about OpenAI funding.
I believe the impression for most people, and for most contractors, was “This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.”[3]
Now Epoch AI or OpenAI don’t say publicly that OpenAI has access to the exercises or answers or solutions. I have heard second-hand that OpenAI does have access to exercises and answers and that they use them for validation. I am not aware of an agreement between Epoch AI and OpenAI that prohibits using this dataset for training if they wanted to, and have slight evidence against such an agreement existing.
In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark.
- ^
Arxiv v5 (Dec 20th version) “We gratefully acknowledge OpenAI for their support in creating the benchmark.”
- ^
I do not know if they have disclosed it in neutral questions about who is funding this.
- ^
This is from a comment by a non-Epoch AI person on HackerNews from two months ago. Another example: Ars Technica writes “FrontierMath’s difficult questions remain unpublished so that AI companies can’t train against it.” in a news article from November.
- ^
meemi’s Shortform
We meet every Tuesday in Oakland at 6:15
I want to make sure this meeting is still on Wednesday the 15th? Thank you. :) And thanks for organizing.
I think this is a great project. I believe your documentary would have high impact via informing and inspiring AI policy discussions. You’ve already interviewed an impressive amount of relevant people. I admire your initiative to take on this project quickly, even before getting funding for it.
Testing for parallel reasoning in LLMs
Great post! I’m glad you did this experiment.
I’ve worked on experiments where I test gpt-3.5-turbo-0125 performance in computing iterates of a given permutation function in one forward pass. Previously my prompts had some of the instructions for the task after specifying the function. After reading your post, I altered my prompts so that all the instructions were given before the problem instance. As with your experiments, this noticeably improved performance, replicating your result that performance is better if instructions are given before the instance of the problem.
For those skeptical about
My personal view is that there was actually very little time between whenever OpenAI received the dataset (the creation started in like September, paper came out Nov 7th) and when o3 was announced, so it makes sense that that version of o3 wasn’t guided at all by FrontierMath.