I didn’t have in mind o1, these exact results seem consistent. Here’s an example I had in mind: Claude 3.5 Sonnet (old) scores 48% on ProtocolQA, and 7.1% on BioLP-benchGPT-4o scores 53% on ProtocolQA and 17% on BioLP-bench
I didn’t have in mind o1, these exact results seem consistent. Here’s an example I had in mind:
Claude 3.5 Sonnet (old) scores 48% on ProtocolQA, and 7.1% on BioLP-bench
GPT-4o scores 53% on ProtocolQA and 17% on BioLP-bench