I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
What do you mean by “easy” here?