On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it’s the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don’t buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:
I’ve personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I’ve also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge.
For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation.
For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e.g. “just remove this if-block”.
Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven’t checked (e.g. FrontierMath) is that they’re also mostly much easier than they’re hyped up to be.
On my current model of Sam Altman, he’s currently very desperate to make it look like there’s no impending AI winter, capabilities are still progressing rapidly, etc. Whether or not it’s intentional on Sam Altman’s part, OpenAI acts accordingly, releasing lots of very over-hyped demos. So, I discount anything hyped out of OpenAI, and doubly so for products which aren’t released publicly (yet).
Over and over again in the past year or so, people have said that some new model is a total game changer for math/coding, and then David will hand it one of the actual math or coding problems we’re working on and it will spit out complete trash. And not like “we underspecified the problem” trash, or “subtle corner case” trash. I mean like “midway through the proof it redefined this variable as a totally different thing and then carried on as though both definitions applied”. The most recent model with which this happened was o1.
Of course I am also tracking the possibility that this is a skill issue on our part, and if that’s the case I would certainly love for someone to help us do better. See this thread for a couple examples of relevant coding tasks.
My median-but-low-confidence guess here is that basically-all the people who find current LLMs to be a massive productivity boost for coding are coding things which are either simple, or complex only in standardized ways—e.g. most web or mobile apps. That’s the sort of coding which mostly involves piping things between different APIs and applying standard patterns, which is where LLMs shine.
@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.
I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.
(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)
@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4⁄5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you’d do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn’t help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I’m sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I’m very happy to bet here.
I think that even if it turns out you’re a bit better than LLMs at this task, we should note that it’s pretty impressive that they’re competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don’t recall it being crazy early)
I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don’t think I updated any answers as a result of the second pass, though I don’t remember very well.
(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn’t like people were trying to crank through them in 7 minutes)
The problems I gave were (as listed in the csv for the diamond problems)
@johnswentworth FWIW, GPQA Diamond seems much harder than GPQA main to me, and current models perform well on it. I suspect these models beat your performance on GPQA diamond if you’re allowed 30 mins per problem. I wouldn’t be shocked if you beat them (maybe I’m like 20%?), but that’s because you’re unusually broadly knowledgeable about science, not just because you’re smart.
I personally get wrecked by GPQA chemistry, get ~50% on GPQA biology if I have like 7 minutes per problem (which is notably better than their experts from other fields get, with much less time), and get like ~80% on GPQA physics with less than 5 minutes per problem. But GPQA Diamond seems much harder.
Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven’t checked (e.g. FrontierMath) is that they’re also mostly much easier than they’re hyped up to be
Daniel Litt’s account here supports this prejudice. As a math professor, he knew instantly how to solve the low/medium-level problems he looked at, and he suggests that each “high”-rated problem would be likewise instantly solvable by an expert in that problem’s subfield.
And since LLMs have eaten ~all of the internet, they essentially have the crystallized-intelligence skills for all (sub)fields of mathematics (and human knowledge in general). So from their perspective, all of those problems are very “shallow”. No human shares their breadth of knowledge, so math professors specialized even in slightly different subfields would indeed have to do a lot of genuine “deep” cognitive work; this is not the case for LLMs.
GPQA stuff is even worse, a literal advanced trivia quiz that seems moderately resistant to literal humans literally googling things, but not to the way the knowledge gets distilled into LLMs.
Basically, I don’t think any extant benchmark (except I guess the Millennium Prize Eval) actually tests “deep” problem-solving skills, in a way LLMs can’t cheat at using their overwhelming knowledge breadth.
My current strong-opinion-weakly-held is that they’re essentially just extensive knowledge databases with a nifty natural-language interface on top.[1] All of the amazing things they do should be considered surprising facts about how far this trick can scale; not surprising facts about how close we are to AGI.
Which is to say: this is the central way to characterize what they are; not merely “isomorphic to a knowledge database with a natural-language search engine on top if you think about them in a really convoluted way”. Obviously a human can also be considered isomorphic to database search if you think about it in a really convoluted way, but that wouldn’t be the most-accurate way to describe a human.
[...] he suggests that each “high”-rated problem would be likewise instantly solvable by an expert in that problem’s subfield.
This is an exaggeration and, as stated, false.
Epoch AI made 5 problems from the benchmark public. One of those was ranked “High”, and that problem was authored by me.
It took me 20-30 hours to create that submission. (To be clear, I considered variations of the problem, ran into some dead ends, spent a lot of time carefully checking my answer was right, wrote up my solution, thought about guess-proof-ness[1] etc., which ate up a lot of time.)
I would call myself an “expert in that problem’s subfield” (e.g. I have authored multiple related papers).
I think you’d be very hard-pressed to find any human who could deliver the correct answer to you within 2 hours of seeing the problem.
E.g. I think it’s highly likely that I couldn’t have done that (I think it’d have taken me more like 5 hours), I’d be surprised if my colleagues in the relevant subfield could do that, and I think the problem is specialized enough that few of the top people in CodeForces or Project Euler could do it.
On the other hand, I don’t think the problem is very hard insight-wise—I think it’s pretty routine, but requires care with details and implementation. There are certainly experts who can see the right main ideas quickly (including me). So there’s something to the point of even FrontierMath problems being surprisingly “shallow”. And as is pointed out in the FM paper, the benchmark is limited to relatively short-scale problems (hours to days for experts) - which really is shallow, as far as the field of mathematics is concerned.
But it’s still an exaggeration to talk about “instantly solvable”. Of course, there’s no escaping of Engel’s maxim “A problem changes from impossible to trivial if a related problem was solved in training”—I guess the problem is instantly solvable to me now… but if you are hard-pressed to find humans that could solve it “instantly” when seeing it the first time, then I wouldn’t describe it in those terms.
Also, there are problems in the benchmark that require more insight than this one.
Daniel Litt writes about the problem: “This one (rated “high”) is a bit trickier but with no thinking at all (just explaining what computation I needed GPT-4o to do) I got the first 3 digits of the answer right (the answer requires six digits, and the in-window python timed out before it could get this far)
Of course *proving* the answer to this one is correct is harder! But I do wonder how many of these problems are accessible to simulation/heuristics. Still an immensely useful tool but IMO people should take a step back before claiming mathematicians will soon be replaced”.
I very much considered naive simulations and heuristics. The problem is getting 6 digits right, not 3. (The AIs are given a limited compute budget.) This is not valid evidence in favor of the problem’s easiness or for the benchmark’s accessibility to simulation/heuristics—indeed, this is evidence in the opposing direction.
See also Evan Chen’s “I saw the organizers were pretty ruthless about rejecting problems for which they felt it was possible to guess the answer with engineer’s induction.”
And fair enough, I used excessively sloppy language. By “instantly solvable”, I did in fact mean “an expert would very quickly (“instantly”) see the correct high-level approach to solving it, with the remaining work being potentially fiddly, but conceptually straightforward”. “Instantly solvable” in the sense of “instantly know how to solve”/”instantly reducible to something that’s trivial to solve”.[1]
FWIW the “medium” and “low” problems I say I immediately knew how to do are very close to things I’ve thought about; the “high”-rated problem above is a bit further, and I suspect an expert closer to it would similarly “instantly” know the answer.
That said,
if you are hard-pressed to find humans that could solve it “instantly” when seeing it the first time, then I wouldn’t describe it in those terms
If there are no humans who can “solve it instantly” (in the above sense), then yes, I wouldn’t call it “shallow”. But if such people do exist (even if they’re incredibly rare), this implies that the conceptual machinery (in the form of theorems or ansatzes) for translating the problem into a trivial one already exists as well. Which, in turn, means it’s likely present in the LLM’s training data. And therefore, from the LLM’s perspective, that problem is trivial to translate into a conceptually trivial problem.
It seems you’d largely agree with that characterization?
Note that I’m not arguing that LLMs aren’t useful or unimpressive-in-every-sense. This is mainly an attempt to build a model of why LLMs seem to perform so well on apparently challenging benchmarks while reportedly falling flat on their faces on much simpler real-life problems.
Or, closer to the way I natively think of it: In the sense that there are people (or small teams of people) with crystallized-intelligence skillsets such that they would be able to solve this problem by plugging their crystallized-intelligence skills one into another, without engaging in prolonged fluid-intelligence problem-solving.
It seems you’d largely agree with that characterization?
Yes. My only hesitation is about how real-life-important it’s for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there’s some relevant subfield there—in which case FrontierMath-style benchmarks would be informative of capability to do real math research.
Also, re-reading Wentworth’s original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1′s math skills doesn’t mean that FM isn’t measuring real math research capability.
The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there’s some relevant subfield there
Previously, I’d intuitively assumed the same as well: that it doesn’t matter if LLMs can’t “genuinely research/innovate”, because there is enough potential for innovative-yet-trivial combination of existing ideas that they’d still massively speed up R&D by finding those combinations. (“Innovation overhang”, as @Nathan Helm-Burger puts it here.)
Back in early 2023, I’d considered it fairly plausible that the world would start heating up in 1-2 years due to such synthetically-generated innovations.
Except this… just doesn’t seem to be happening? I’m yet to hear of a single useful scientific paper or other meaningful innovation that was spearheaded by a LLM.[1] And they’re already adept at comprehending such innovative-yet-trivial combinations if a human prompts them with those combinations. So it’s not the matter of not yet being able to understand or appreciate the importance of such synergies. (If Sonnet 3.5.1 or o1 pro didn’t do it, I doubt o3 would.)
Yet this is still not happening. My guess is that “innovative-yet-trivial combinations of existing ideas” are not actually “trivial”, and LLMs can’t do that for the same reasons they can’t do “genuine research” (whatever those reasons are).
Admittedly it’s possible that this is totally happening all over the place and people are just covering it up in order to have all of the glory/status for themselves. But I doubt it: there are enough remarkably selfless LLM enthusiasts that if this were happening, I’d expect it would’ve gone viral already.
It’s only now that LLMs are reasonably competent in at least some hard problems, and at any rate, I expect RL to basically solve the domain, because of verifiability properties combined with quite a bit of training data.
We should wait a few years, as we have another scale-up that’s coming up, and it will probably be quite a jump from current AI due to more compute:
It’s only now that LLMs are reasonably competent in at least some hard problems
I don’t think that’s the limiter here. Reports in the style of “my unpublished PhD thesis was about doing X using Y methodology, I asked an LLM to do that and it one-shot a year of my work! the equations it derived are correct!” have been around for quite a while. I recall it at least in relation to Claude 3, and more recently, o1-preview.
If LLMs are prompted to combine two ideas, they’ve been perfectly capable of “innovating” for ages now, including at fairly high levels of expertise. I’m sure there’s some sort of cross-disciplinary GPQA-like benchmark that they’ve saturated a while ago, so this is even legible.
The trick is picking which ideas to combine/in what direction to dig. This doesn’t appear to be something LLMs are capable of doing well on their own, nor do they seem to speed up human performance on this task. (All cases of them succeeding at it so far have been, by definition, “searching under the streetlight”: checking whether they can appreciate a new idea that a human already found on their own and evaluated as useful.)
I suppose it’s possible that o3 or its successors change that (the previous benchmarks weren’t measuring that, but surely FrontierMath does...). We’ll see.
I expect RL to basically solve the domain
Mm, I think it’s still up in the air whether even the o-series efficiently scales (as in, without requiring a Dyson Swarm’s worth of compute) to beating the Millennium Prize Eval (or some less legendary yet still major problems).
I expect such problems don’t pass the “can this problem be solved by plugging the extant crystallized-intelligence skills of a number of people into each other in a non-contrived[1] way?” test. Does RL training allow to sidestep this, letting the model generate new crystallized-intelligence skills?
I’m not confident one way or another.
we have another scale-up that’s coming up
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
But I predict, with low-moderate confidence, that it still won’t kick off a deluge of synthetically derived innovations. It’d have even more breadth and eye for nuance, but somehow, perplexingly, still no ability to use those capabilities autonomously.
“Non-contrived” because technically, any cognitive skill is just a combination of e. g. NAND gates, since those are Turing-complete. But obviously that doesn’t mean any such skill is accessible if you’ve learned the NAND gate. Intuitively, a combination of crystallized-intelligence skills is only accessible if the idea of combining them is itself a crystallized-intelligence skill (e. g., in the math case, a known ansatz).
Which perhaps sheds some light on why LLMs can’t innovate even via trivial ideas combinations. If a given idea-combination “template” weren’t present in the training data, the LLM can’t reliably independently conceive of it except by brute-force enumeration...? This doesn’t seem quite right, but maybe in the right direction.
I think my key crux is that in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance, and mathematics/programming are domains that are unusually easy to verify/gather training data for RL performance, so with caveats it can become rather good at those specific domains/benchmarks like millennium prize evals, but the important caveat is I don’t believe this transfers very well to domains where verifying isn’t easy, like creative writing.
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
I was talking about the 1 GW systems that would be developed in late 2026-early 2027, not GPT-5.
in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
I think a different phenomenon is occuring. My guess, updating on my own experience, is that ideas aren’t the current bottleneck. 1% inspiration, 99% perspiration.
As someone who has been reading 3-20 papers per month for many years now, in neuroscience and machine learning, I feel overwhelmed with ideas. I average about 0.75 per paper. I write them down, and the lists grow faster than they shrink by two orders of magnitude.
When I was on my favorite industry team, what I most valued about my technical manager was his ability to help me sort through and prioritize them. It was like I created a bunch of LEGO pieces, he picked one to be next, I put it in place by coding it up, he checked the placement by reviewing my PR. If someone has offered me a source of ideas ranging in quality between worse than my worst ideas, and almost as good as my best ideas, and skewed towards bad… I’d have laughed and turned them down without a second thought.
For something like a paper instead of a minor tech idea for 1 week PR… The situation is far more intense. The grunt work of running the experiments and preparing the paper is enormous compared to the time and effort of coming up with the idea in the first place. More like 0.1% to 99.9%.
Current LLMs can speed up creating a paper if given the results and experiment description to write about. That’s probably also not the primary bottleneck (although still more than idea generation).
So the current bottleneck, in my estimation, for ml experiments, is the experiments. Coding up the experiments accurately and efficiently, running them (and handling the compute costs), analyzing the results.
So I’ve been expecting to see an acceleration dependent on that aspect. That’s hard to measure though. Are LLMs currently speeding this work up a little? Probably. I’ve had my work sped up some by the recent Sonnet 3.5.1. Currently though it’s a trade-off, there’s overhead in checking for misinterpretations and correcting bugs. We still seem a long way in “capability space” from me being able to give a background paper and rough experiment description, and then having the model do the rest. Only once that’s the case will idea generation become my bottleneck.
That’s the opposite of my experience. Nearly all the papers I read vary between “trash, I got nothing useful out besides an idea for a post explaining the relevant failure modes” and “high quality but not relevant to anything important”. Setting up our experiments is historically much faster than the work of figuring out what experiments would actually be useful.
There are exceptions to this, large projects which seem useful and would require lots of experimental work, but they’re usually much lower-expected-value-per-unit-time than going back to the whiteboard, understanding things better, and doing a simpler experiment once we know what to test.
Ah, well, for most papers that spark an idea in me, the idea isn’t simply an extension of the paper. It’s a question tangentially related which probes at my own frontier of understanding.
I’ve always found that a boring lecture is a great opportunity to brainstorm because my mind squirms away from the boredom into invention and extrapolation of related ideas. A boring paper does some of the same for me, except that I’m less socially pressured to keep reading it, and thus less able to squeeze my mind with the boredom of it.
As for coming up with ideas… It is a weakness of mind that I am far better at generating ideas than at critiquing them (my own or others). Which is why I worked so well in a team where I had someone I trusted to sort through my ideas and pick out the valuable ones. It sounds to me like you have a better filter on idea quality.
That’s mostly my experience as well: experiments are near-trivial to set up, and setting up any experiment that isn’t near-trivial to set up is a poor use of the time that can instead be spent thinking on the topic a bit more and realizing what the experimental outcome would be or why this would be entirely the wrong experiment to run.
But the friction costs of setting up an experiment aren’t zero. If it were possible to sort of ramble an idea at an AI and then have it competently execute the corresponding experiment (or set up a toy formal model and prove things about it), I think this would be able to speed up even deeply confused/non-paradigmatic research.
… That said, I think the sorts of experiments we do aren’t the sorts of experiments ML researchers do. I expect they’re often things like “do a pass over this lattice of hyperparameters and output the values that produce the best loss” (and more abstract equivalents of this that can’t be as easily automated using mundane code). And which, due to the atheoretic nature of ML, can’t be “solved in the abstract”.
So ML research perhaps could be dramatically sped up by menial-software-labor AIs. (Though I think even now the compute needed for running all of those experiments would be the more pressing bottleneck.)
of the amazing things they do should be considered surprising facts about how far this trick can scale; not surprising facts about how close we are to AGI.
I agree that the trick scaling as far as it has is surprising, but I’d disagree with the claim that this doesn’t bear on AGI.
I do think that something like dumb scaling can mostly just work, and I think the main takeaway I take from AI progress is that there will not a be a clear resolution to when AGI happens, as the first AIs to automate AI research will have very different skill profiles from humans, and most importantly we need to disentangle capabilities in a way we usually don’t for humans.
I agree with faul sname here:
we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for”.
I do think that something like dumb scaling can mostly just work
The exact degree of “mostly” is load-bearing here. You’d mentioned provisions for error-correction before. But are the necessary provisions something simple, such that the most blatantly obvious wrappers/prompt-engineering works, or do we need to derive some additional nontrivial theoretical insights to correctly implement them?
Last I checked, AutoGPT-like stuff has mostly failed, so I’m inclined to think it’s closer to the latter.
I am unconvinced that “the” reliability issue is a single issue that will be solved by a single insight, rather than AIs lacking procedural knowledge of how to handle a bunch of finicky special cases that will be solved by online learning or very long context windows once hardware costs decrease enough to make one of those approaches financially viable.
Yeah, I’m sympathetic to this argument that there won’t be a single insight, and that at least one approach will work out once hardware costs decrease enough, and I agree less with Thane Ruthenis’s intuitions here than I did before.
If I were to think about it a little, I’d suspect the big difference that LLMs and humans have is state/memory, where humans do have state/memory, but LLMs are currently more or less stateless today, and RNN training has not been solved to the extent transformers were.
One thing I will also say is that AI winters will be shorter than previous AI winters, because AI products can now be sort of made profitable, and this gives an independent base of money for AI research in ways that weren’t possible pre-2016.
A factor stemming from the same cause but pushing in the opposite direction is that “mundane” AI profitability can “distract” people who would otherwise be AGI hawks.
I agree with you on your assessment of GPQA. The questions themselves appear to be low quality as well. Take this one example, although it’s not from GPQA Diamond:
In UV/Vis spectroscopy, a chromophore which absorbs red colour light, emits _____ colour light.
The correct answer is stated as yellow and blue. However, the question should read transmits, not emits; molecules cannot trivially absorb and re-emit light of a shorter wavelength without resorting to trickery (nonlinear effects, two-photon absorption).
This is, of course, a cherry-picked example, but is exactly characteristic of the sort of low-quality science questions I saw in school (e.g with a teacher or professor who didn’t understand the material very well). Scrolling through the rest of the GPQA questions, they did not seem like questions that would require deep reflection or thinking, but rather the sort of trivia things that I would expect LLMs to perform extremely well on.
I’d also expect “popular” benchmarks to be easier/worse/optimized for looking good while actually being relatively easy. OAI et. al probably have the mother of all publication biases with respect to benchmarks, and are selecting very heavily for items within this collection.
On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it’s the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don’t buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:
I’ve personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I’ve also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge.
For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation.
For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e.g. “just remove this if-block”.
Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven’t checked (e.g. FrontierMath) is that they’re also mostly much easier than they’re hyped up to be.
On my current model of Sam Altman, he’s currently very desperate to make it look like there’s no impending AI winter, capabilities are still progressing rapidly, etc. Whether or not it’s intentional on Sam Altman’s part, OpenAI acts accordingly, releasing lots of very over-hyped demos. So, I discount anything hyped out of OpenAI, and doubly so for products which aren’t released publicly (yet).
Over and over again in the past year or so, people have said that some new model is a total game changer for math/coding, and then David will hand it one of the actual math or coding problems we’re working on and it will spit out complete trash. And not like “we underspecified the problem” trash, or “subtle corner case” trash. I mean like “midway through the proof it redefined this variable as a totally different thing and then carried on as though both definitions applied”. The most recent model with which this happened was o1.
Of course I am also tracking the possibility that this is a skill issue on our part, and if that’s the case I would certainly love for someone to help us do better. See this thread for a couple examples of relevant coding tasks.
My median-but-low-confidence guess here is that basically-all the people who find current LLMs to be a massive productivity boost for coding are coding things which are either simple, or complex only in standardized ways—e.g. most web or mobile apps. That’s the sort of coding which mostly involves piping things between different APIs and applying standard patterns, which is where LLMs shine.
I just spent some time doing GPQA, and I think I agree with you that the difficulty of those problems is overrated. I plan to write up more on this.
@johnswentworth Do you agree with me that modern LLMs probably outperform (you with internet access and 30 minutes) on GPQA diamond? I personally think this somewhat contradicts the narrative of your comment if so.
I don’t know, I have not specifically tried GPQA diamond problems. I’ll reply again if and when I do.
I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.
(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)
@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4⁄5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you’d do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn’t help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I’m sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I’m very happy to bet here.
I think that even if it turns out you’re a bit better than LLMs at this task, we should note that it’s pretty impressive that they’re competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
Even assuming you’re correct here, I don’t see how that would make my original post pretty misleading?
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don’t recall it being crazy early)
I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don’t think I updated any answers as a result of the second pass, though I don’t remember very well.
(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn’t like people were trying to crank through them in 7 minutes)
The problems I gave were (as listed in the csv for the diamond problems)
#1 (Physics) (1 person got right, 3 got wrong, 1 didn’t answer)
#2 (Organic Chemistry), (John got right, I think 3 people didn’t finish)
#4 (Electromagnetism), (John and one other got right, 2 got wrong)
#8 (Genetics) (3 got right including John)
#10 (Astrophysics) (5 people got right)
@johnswentworth FWIW, GPQA Diamond seems much harder than GPQA main to me, and current models perform well on it. I suspect these models beat your performance on GPQA diamond if you’re allowed 30 mins per problem. I wouldn’t be shocked if you beat them (maybe I’m like 20%?), but that’s because you’re unusually broadly knowledgeable about science, not just because you’re smart.
I personally get wrecked by GPQA chemistry, get ~50% on GPQA biology if I have like 7 minutes per problem (which is notably better than their experts from other fields get, with much less time), and get like ~80% on GPQA physics with less than 5 minutes per problem. But GPQA Diamond seems much harder.
Is this with internet access for you?
Yes, I’d be way worse off without internet access.
Daniel Litt’s account here supports this prejudice. As a math professor, he knew instantly how to solve the low/medium-level problems he looked at, and he suggests that each “high”-rated problem would be likewise instantly solvable by an expert in that problem’s subfield.
And since LLMs have eaten ~all of the internet, they essentially have the crystallized-intelligence skills for all (sub)fields of mathematics (and human knowledge in general). So from their perspective, all of those problems are very “shallow”. No human shares their breadth of knowledge, so math professors specialized even in slightly different subfields would indeed have to do a lot of genuine “deep” cognitive work; this is not the case for LLMs.
GPQA stuff is even worse, a literal advanced trivia quiz that seems moderately resistant to literal humans literally googling things, but not to the way the knowledge gets distilled into LLMs.
Basically, I don’t think any extant benchmark (except I guess the Millennium Prize Eval) actually tests “deep” problem-solving skills, in a way LLMs can’t cheat at using their overwhelming knowledge breadth.
My current strong-opinion-weakly-held is that they’re essentially just extensive knowledge databases with a nifty natural-language interface on top.[1] All of the amazing things they do should be considered surprising facts about how far this trick can scale; not surprising facts about how close we are to AGI.
Which is to say: this is the central way to characterize what they are; not merely “isomorphic to a knowledge database with a natural-language search engine on top if you think about them in a really convoluted way”. Obviously a human can also be considered isomorphic to database search if you think about it in a really convoluted way, but that wouldn’t be the most-accurate way to describe a human.
This is an exaggeration and, as stated, false.
Epoch AI made 5 problems from the benchmark public. One of those was ranked “High”, and that problem was authored by me.
It took me 20-30 hours to create that submission. (To be clear, I considered variations of the problem, ran into some dead ends, spent a lot of time carefully checking my answer was right, wrote up my solution, thought about guess-proof-ness[1] etc., which ate up a lot of time.)
I would call myself an “expert in that problem’s subfield” (e.g. I have authored multiple related papers).
I think you’d be very hard-pressed to find any human who could deliver the correct answer to you within 2 hours of seeing the problem.
E.g. I think it’s highly likely that I couldn’t have done that (I think it’d have taken me more like 5 hours), I’d be surprised if my colleagues in the relevant subfield could do that, and I think the problem is specialized enough that few of the top people in CodeForces or Project Euler could do it.
On the other hand, I don’t think the problem is very hard insight-wise—I think it’s pretty routine, but requires care with details and implementation. There are certainly experts who can see the right main ideas quickly (including me). So there’s something to the point of even FrontierMath problems being surprisingly “shallow”. And as is pointed out in the FM paper, the benchmark is limited to relatively short-scale problems (hours to days for experts) - which really is shallow, as far as the field of mathematics is concerned.
But it’s still an exaggeration to talk about “instantly solvable”. Of course, there’s no escaping of Engel’s maxim “A problem changes from impossible to trivial if a related problem was solved in training”—I guess the problem is instantly solvable to me now… but if you are hard-pressed to find humans that could solve it “instantly” when seeing it the first time, then I wouldn’t describe it in those terms.
Also, there are problems in the benchmark that require more insight than this one.
Daniel Litt writes about the problem: “This one (rated “high”) is a bit trickier but with no thinking at all (just explaining what computation I needed GPT-4o to do) I got the first 3 digits of the answer right (the answer requires six digits, and the in-window python timed out before it could get this far)
Of course *proving* the answer to this one is correct is harder! But I do wonder how many of these problems are accessible to simulation/heuristics. Still an immensely useful tool but IMO people should take a step back before claiming mathematicians will soon be replaced”.
I very much considered naive simulations and heuristics. The problem is getting 6 digits right, not 3. (The AIs are given a limited compute budget.) This is not valid evidence in favor of the problem’s easiness or for the benchmark’s accessibility to simulation/heuristics—indeed, this is evidence in the opposing direction.
See also Evan Chen’s “I saw the organizers were pretty ruthless about rejecting problems for which they felt it was possible to guess the answer with engineer’s induction.”
Thanks, that’s important context!
And fair enough, I used excessively sloppy language. By “instantly solvable”, I did in fact mean “an expert would very quickly (“instantly”) see the correct high-level approach to solving it, with the remaining work being potentially fiddly, but conceptually straightforward”. “Instantly solvable” in the sense of “instantly know how to solve”/”instantly reducible to something that’s trivial to solve”.[1]
Which was based on this quote of Litt’s:
That said,
If there are no humans who can “solve it instantly” (in the above sense), then yes, I wouldn’t call it “shallow”. But if such people do exist (even if they’re incredibly rare), this implies that the conceptual machinery (in the form of theorems or ansatzes) for translating the problem into a trivial one already exists as well. Which, in turn, means it’s likely present in the LLM’s training data. And therefore, from the LLM’s perspective, that problem is trivial to translate into a conceptually trivial problem.
It seems you’d largely agree with that characterization?
Note that I’m not arguing that LLMs aren’t useful or unimpressive-in-every-sense. This is mainly an attempt to build a model of why LLMs seem to perform so well on apparently challenging benchmarks while reportedly falling flat on their faces on much simpler real-life problems.
Or, closer to the way I natively think of it: In the sense that there are people (or small teams of people) with crystallized-intelligence skillsets such that they would be able to solve this problem by plugging their crystallized-intelligence skills one into another, without engaging in prolonged fluid-intelligence problem-solving.
This looks reasonable to me.
Yes. My only hesitation is about how real-life-important it’s for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there’s some relevant subfield there—in which case FrontierMath-style benchmarks would be informative of capability to do real math research.
Also, re-reading Wentworth’s original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1′s math skills doesn’t mean that FM isn’t measuring real math research capability.
Previously, I’d intuitively assumed the same as well: that it doesn’t matter if LLMs can’t “genuinely research/innovate”, because there is enough potential for innovative-yet-trivial combination of existing ideas that they’d still massively speed up R&D by finding those combinations. (“Innovation overhang”, as @Nathan Helm-Burger puts it here.)
Back in early 2023, I’d considered it fairly plausible that the world would start heating up in 1-2 years due to such synthetically-generated innovations.
Except this… just doesn’t seem to be happening? I’m yet to hear of a single useful scientific paper or other meaningful innovation that was spearheaded by a LLM.[1] And they’re already adept at comprehending such innovative-yet-trivial combinations if a human prompts them with those combinations. So it’s not the matter of not yet being able to understand or appreciate the importance of such synergies. (If Sonnet 3.5.1 or o1 pro didn’t do it, I doubt o3 would.)
Yet this is still not happening. My guess is that “innovative-yet-trivial combinations of existing ideas” are not actually “trivial”, and LLMs can’t do that for the same reasons they can’t do “genuine research” (whatever those reasons are).
Admittedly it’s possible that this is totally happening all over the place and people are just covering it up in order to have all of the glory/status for themselves. But I doubt it: there are enough remarkably selfless LLM enthusiasts that if this were happening, I’d expect it would’ve gone viral already.
There are 2 things to keep in mind:
It’s only now that LLMs are reasonably competent in at least some hard problems, and at any rate, I expect RL to basically solve the domain, because of verifiability properties combined with quite a bit of training data.
We should wait a few years, as we have another scale-up that’s coming up, and it will probably be quite a jump from current AI due to more compute:
https://www.lesswrong.com/posts/NXTkEiaLA4JdS5vSZ/?commentId=7KSdmzK3hgcxkzmPX
I don’t think that’s the limiter here. Reports in the style of “my unpublished PhD thesis was about doing X using Y methodology, I asked an LLM to do that and it one-shot a year of my work! the equations it derived are correct!” have been around for quite a while. I recall it at least in relation to Claude 3, and more recently, o1-preview.
If LLMs are prompted to combine two ideas, they’ve been perfectly capable of “innovating” for ages now, including at fairly high levels of expertise. I’m sure there’s some sort of cross-disciplinary GPQA-like benchmark that they’ve saturated a while ago, so this is even legible.
The trick is picking which ideas to combine/in what direction to dig. This doesn’t appear to be something LLMs are capable of doing well on their own, nor do they seem to speed up human performance on this task. (All cases of them succeeding at it so far have been, by definition, “searching under the streetlight”: checking whether they can appreciate a new idea that a human already found on their own and evaluated as useful.)
I suppose it’s possible that o3 or its successors change that (the previous benchmarks weren’t measuring that, but surely FrontierMath does...). We’ll see.
Mm, I think it’s still up in the air whether even the o-series efficiently scales (as in, without requiring a Dyson Swarm’s worth of compute) to beating the Millennium Prize Eval (or some less legendary yet still major problems).
I expect such problems don’t pass the “can this problem be solved by plugging the extant crystallized-intelligence skills of a number of people into each other in a non-contrived[1] way?” test. Does RL training allow to sidestep this, letting the model generate new crystallized-intelligence skills?
I’m not confident one way or another.
I’m bearish on that. I expect GPT-4 to GPT-5 to be palatably less of a jump than GPT-3 to GPT-4, same way GPT-3 to GPT-4 was less of a jump than GPT-2 to GPT-3. I’m sure it’d show lower loss, and saturate some more benchmarks, and perhaps an o-series model based on it clears FrontierMath, and perhaps programmers and mathematicians would be able to use it in an ever-so-bigger number of cases...
But I predict, with low-moderate confidence, that it still won’t kick off a deluge of synthetically derived innovations. It’d have even more breadth and eye for nuance, but somehow, perplexingly, still no ability to use those capabilities autonomously.
“Non-contrived” because technically, any cognitive skill is just a combination of e. g. NAND gates, since those are Turing-complete. But obviously that doesn’t mean any such skill is accessible if you’ve learned the NAND gate. Intuitively, a combination of crystallized-intelligence skills is only accessible if the idea of combining them is itself a crystallized-intelligence skill (e. g., in the math case, a known ansatz).
Which perhaps sheds some light on why LLMs can’t innovate even via trivial ideas combinations. If a given idea-combination “template” weren’t present in the training data, the LLM can’t reliably independently conceive of it except by brute-force enumeration...? This doesn’t seem quite right, but maybe in the right direction.
I think my key crux is that in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance, and mathematics/programming are domains that are unusually easy to verify/gather training data for RL performance, so with caveats it can become rather good at those specific domains/benchmarks like millennium prize evals, but the important caveat is I don’t believe this transfers very well to domains where verifying isn’t easy, like creative writing.
I was talking about the 1 GW systems that would be developed in late 2026-early 2027, not GPT-5.
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
I don’t think we know yet whether it will succeed in practice, or whether it training costs make it infeasibble to do.
Consider: https://www.cognitiverevolution.ai/can-ais-generate-novel-research-ideas-with-lead-author-chenglei-si/
I think a different phenomenon is occuring. My guess, updating on my own experience, is that ideas aren’t the current bottleneck. 1% inspiration, 99% perspiration.
As someone who has been reading 3-20 papers per month for many years now, in neuroscience and machine learning, I feel overwhelmed with ideas. I average about 0.75 per paper. I write them down, and the lists grow faster than they shrink by two orders of magnitude.
When I was on my favorite industry team, what I most valued about my technical manager was his ability to help me sort through and prioritize them. It was like I created a bunch of LEGO pieces, he picked one to be next, I put it in place by coding it up, he checked the placement by reviewing my PR. If someone has offered me a source of ideas ranging in quality between worse than my worst ideas, and almost as good as my best ideas, and skewed towards bad… I’d have laughed and turned them down without a second thought.
For something like a paper instead of a minor tech idea for 1 week PR… The situation is far more intense. The grunt work of running the experiments and preparing the paper is enormous compared to the time and effort of coming up with the idea in the first place. More like 0.1% to 99.9%.
Current LLMs can speed up creating a paper if given the results and experiment description to write about. That’s probably also not the primary bottleneck (although still more than idea generation).
So the current bottleneck, in my estimation, for ml experiments, is the experiments. Coding up the experiments accurately and efficiently, running them (and handling the compute costs), analyzing the results.
So I’ve been expecting to see an acceleration dependent on that aspect. That’s hard to measure though. Are LLMs currently speeding this work up a little? Probably. I’ve had my work sped up some by the recent Sonnet 3.5.1. Currently though it’s a trade-off, there’s overhead in checking for misinterpretations and correcting bugs. We still seem a long way in “capability space” from me being able to give a background paper and rough experiment description, and then having the model do the rest. Only once that’s the case will idea generation become my bottleneck.
That’s the opposite of my experience. Nearly all the papers I read vary between “trash, I got nothing useful out besides an idea for a post explaining the relevant failure modes” and “high quality but not relevant to anything important”. Setting up our experiments is historically much faster than the work of figuring out what experiments would actually be useful.
There are exceptions to this, large projects which seem useful and would require lots of experimental work, but they’re usually much lower-expected-value-per-unit-time than going back to the whiteboard, understanding things better, and doing a simpler experiment once we know what to test.
Ah, well, for most papers that spark an idea in me, the idea isn’t simply an extension of the paper. It’s a question tangentially related which probes at my own frontier of understanding.
I’ve always found that a boring lecture is a great opportunity to brainstorm because my mind squirms away from the boredom into invention and extrapolation of related ideas. A boring paper does some of the same for me, except that I’m less socially pressured to keep reading it, and thus less able to squeeze my mind with the boredom of it.
As for coming up with ideas… It is a weakness of mind that I am far better at generating ideas than at critiquing them (my own or others). Which is why I worked so well in a team where I had someone I trusted to sort through my ideas and pick out the valuable ones. It sounds to me like you have a better filter on idea quality.
That’s mostly my experience as well: experiments are near-trivial to set up, and setting up any experiment that isn’t near-trivial to set up is a poor use of the time that can instead be spent thinking on the topic a bit more and realizing what the experimental outcome would be or why this would be entirely the wrong experiment to run.
But the friction costs of setting up an experiment aren’t zero. If it were possible to sort of ramble an idea at an AI and then have it competently execute the corresponding experiment (or set up a toy formal model and prove things about it), I think this would be able to speed up even deeply confused/non-paradigmatic research.
… That said, I think the sorts of experiments we do aren’t the sorts of experiments ML researchers do. I expect they’re often things like “do a pass over this lattice of hyperparameters and output the values that produce the best loss” (and more abstract equivalents of this that can’t be as easily automated using mundane code). And which, due to the atheoretic nature of ML, can’t be “solved in the abstract”.
So ML research perhaps could be dramatically sped up by menial-software-labor AIs. (Though I think even now the compute needed for running all of those experiments would be the more pressing bottleneck.)
Convincing.
I agree that the trick scaling as far as it has is surprising, but I’d disagree with the claim that this doesn’t bear on AGI.
I do think that something like dumb scaling can mostly just work, and I think the main takeaway I take from AI progress is that there will not a be a clear resolution to when AGI happens, as the first AIs to automate AI research will have very different skill profiles from humans, and most importantly we need to disentangle capabilities in a way we usually don’t for humans.
I agree with faul sname here:
The exact degree of “mostly” is load-bearing here. You’d mentioned provisions for error-correction before. But are the necessary provisions something simple, such that the most blatantly obvious wrappers/prompt-engineering works, or do we need to derive some additional nontrivial theoretical insights to correctly implement them?
Last I checked, AutoGPT-like stuff has mostly failed, so I’m inclined to think it’s closer to the latter.
Actually, I’ve changed my mind, in that the reliability issue probably does need at least non-trivial theoretical insights to make AIs work.
I am unconvinced that “the” reliability issue is a single issue that will be solved by a single insight, rather than AIs lacking procedural knowledge of how to handle a bunch of finicky special cases that will be solved by online learning or very long context windows once hardware costs decrease enough to make one of those approaches financially viable.
Yeah, I’m sympathetic to this argument that there won’t be a single insight, and that at least one approach will work out once hardware costs decrease enough, and I agree less with Thane Ruthenis’s intuitions here than I did before.
If I were to think about it a little, I’d suspect the big difference that LLMs and humans have is state/memory, where humans do have state/memory, but LLMs are currently more or less stateless today, and RNN training has not been solved to the extent transformers were.
One thing I will also say is that AI winters will be shorter than previous AI winters, because AI products can now be sort of made profitable, and this gives an independent base of money for AI research in ways that weren’t possible pre-2016.
A factor stemming from the same cause but pushing in the opposite direction is that “mundane” AI profitability can “distract” people who would otherwise be AGI hawks.
I agree with you on your assessment of GPQA. The questions themselves appear to be low quality as well. Take this one example, although it’s not from GPQA Diamond:
The correct answer is stated as yellow and blue. However, the question should read transmits, not emits; molecules cannot trivially absorb and re-emit light of a shorter wavelength without resorting to trickery (nonlinear effects, two-photon absorption).
This is, of course, a cherry-picked example, but is exactly characteristic of the sort of low-quality science questions I saw in school (e.g with a teacher or professor who didn’t understand the material very well). Scrolling through the rest of the GPQA questions, they did not seem like questions that would require deep reflection or thinking, but rather the sort of trivia things that I would expect LLMs to perform extremely well on.
I’d also expect “popular” benchmarks to be easier/worse/optimized for looking good while actually being relatively easy. OAI et. al probably have the mother of all publication biases with respect to benchmarks, and are selecting very heavily for items within this collection.
Personally, I think o1 is uniquely trash, I think o1-preview was actually better. Getting on average, better things from deepseek and sonnet 3.5 atm.