I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
Even assuming you’re correct here, I don’t see how that would make my original post pretty misleading?
I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
What do you mean by “easy” here?