The craziest thing for me is that the results of different evals, like ProtocolQA and my BioLP-bench, that suppose to evaluate similar things, are highly inconsistent. For example, two models can have similar scores on ProtocolQA, but one scores twice as much answers on BioLP-bench as the other. It means that we might not measure things we think we measure. And no one knows what causes this difference in the results.
Do you feel that’s still an issue when you comapre to the human expert baseline?
Human experts scored 79% on ProtocolQA, and o1-preview scored 81%
Human experts scored 38% on BioLP-bench, and o1-preview scored 36%
The fact that both human experts and o1-preview scored twice as high on ProtocolQA than BioLP-bench doesn’t feel that inconsistent to me. It seems your BioLP-bench questions are just “twice as hard”. I’d find it more inconsistent if o1-preview matched human expert performance on one test but not the other.
(There are other open questions about whether the human experts in both studies are comparable and how much time people had)
And I’m unsure that experts are comparable, to be frank. Due to financial limitations, I used graduate students in BioLP, while the authors of LAB-bench used PhD-level scientists.
Good post.
The craziest thing for me is that the results of different evals, like ProtocolQA and my BioLP-bench, that suppose to evaluate similar things, are highly inconsistent. For example, two models can have similar scores on ProtocolQA, but one scores twice as much answers on BioLP-bench as the other. It means that we might not measure things we think we measure. And no one knows what causes this difference in the results.
Do you feel that’s still an issue when you comapre to the human expert baseline?
Human experts scored 79% on ProtocolQA, and o1-preview scored 81%
Human experts scored 38% on BioLP-bench, and o1-preview scored 36%
The fact that both human experts and o1-preview scored twice as high on ProtocolQA than BioLP-bench doesn’t feel that inconsistent to me. It seems your BioLP-bench questions are just “twice as hard”. I’d find it more inconsistent if o1-preview matched human expert performance on one test but not the other.
(There are other open questions about whether the human experts in both studies are comparable and how much time people had)
And I’m unsure that experts are comparable, to be frank. Due to financial limitations, I used graduate students in BioLP, while the authors of LAB-bench used PhD-level scientists.
I didn’t have in mind o1, these exact results seem consistent. Here’s an example I had in mind:
Claude 3.5 Sonnet (old) scores 48% on ProtocolQA, and 7.1% on BioLP-bench
GPT-4o scores 53% on ProtocolQA and 17% on BioLP-bench