Nikola Jurkovic comments on nikola’s Shortform

Nikola Jurkovic 26 Jan 2025 16:52 UTC
117 points
44
DeepSeek R1 being #1 on Humanity’s Last Exam is not strong evidence that it’s the best model, because the questions were adversarially filtered against o1, Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o. If they weren’t filtered against those models, I’d bet o1 would outperform R1.
To ensure question difficulty, we automatically check the accuracy of frontier LLMs on each question prior to submission. Our testing process uses multi-modal LLMs for text-and-image questions (GPT-4O, GEMINI 1.5 PRO, CLAUDE 3.5 SONNET, O1) and adds two non-multi-modal models (O1MINI, O1-PREVIEW) for text-only questions. We use different submission criteria by question type: exact-match questions must stump all models, while multiple-choice questions must stump all but one model to account for potential lucky guesses.
If I were writing the paper I would have added either a footnote or an additional column to Table 1 getting across that GPT-4o, o1, Gemini 1.5 Pro, and Claude 3.5 Sonnet were adversarially filtered against. Most people just see Table 1 so it seems important to get across.
- sweenesm 27 Jan 2025 2:14 UTC
  10 points
  4
  Parent
  Yes, this is point #1 from my recent Quick Take. Another interesting point is that there are no confidence intervals on the accuracy numbers—it looks like they only ran the questions once in each model, so we don’t know how much random variation might account for the differences between accuracy numbers. [Note added 2-3-25: I’m not sure why it didn’t make the paper, but Scale AI does report confidence intervals on their website.]
- yc 27 Jan 2025 17:00 UTC
  4 points
  0
  Parent
  Terminology question—does adversarial filtering mean the same thing as decontamination?
  - Lukas_Gloor 27 Jan 2025 23:23 UTC
    5 points
    0
    Parent
    In order to submit a question to the benchmark, people had to run it against the listed LLMs; the question would only advance to the next stage once the LLMs used for this testing got it wrong.