I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.
(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)
@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4⁄5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you’d do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn’t help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I’m sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I’m very happy to bet here.
I think that even if it turns out you’re a bit better than LLMs at this task, we should note that it’s pretty impressive that they’re competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don’t recall it being crazy early)
I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don’t think I updated any answers as a result of the second pass, though I don’t remember very well.
(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn’t like people were trying to crank through them in 7 minutes)
The problems I gave were (as listed in the csv for the diamond problems)
I at least attempted to be filtering the problems I gave you for GPQA diamond, although I am not very confident that I succeeded.
(Update: yes, the problems John did were GPQA diamond. I gave 5 problems to a group of 8 people, and gave them two hours to complete however many they thought they could complete without getting any wrong)
@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4⁄5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you’d do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn’t help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I’m sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I’m very happy to bet here.
I think that even if it turns out you’re a bit better than LLMs at this task, we should note that it’s pretty impressive that they’re competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
Even assuming you’re correct here, I don’t see how that would make my original post pretty misleading?
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
What do you mean by “easy” here?
(my guess is you took more like 15-25 minutes per question? Hard to tell from my notes, you may have finished early but I don’t recall it being crazy early)
I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don’t think I updated any answers as a result of the second pass, though I don’t remember very well.
(This seems like more time than Buck was taking – the goal was to not get any wrong so it wasn’t like people were trying to crank through them in 7 minutes)
The problems I gave were (as listed in the csv for the diamond problems)
#1 (Physics) (1 person got right, 3 got wrong, 1 didn’t answer)
#2 (Organic Chemistry), (John got right, I think 3 people didn’t finish)
#4 (Electromagnetism), (John and one other got right, 2 got wrong)
#8 (Genetics) (3 got right including John)
#10 (Astrophysics) (5 people got right)