Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

jacobjacob 7 Jan 2024 17:49 UTC
2 points
0
Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)

A series on benchmarks does seem very interesting and useful—but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.
- Bruce W. Lee 7 Jan 2024 19:46 UTC
  2 points
  0
  Parent
  Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.
  
  But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?
  
  I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.
  
  Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!
  - jacobjacob 7 Jan 2024 20:33 UTC
    2 points
    0
    Parent
    If that’s your belief, I think you should edit in a disclaimer to your TL;DR section, like “Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology”.
    Also, the numbers aren’t “non-provable”: anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)
    - Bruce W. Lee 8 Jan 2024 2:59 UTC
      1 point
      0
      Parent
      Thanks for the recommendation, though I’ll think of a more fundamental solution to satisfy all ethical/communal concerns.
      
      ”Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology.” Regarding this, just to sort everything out, because I’m writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It’s just me questioning everything when I still can as a student. But I’ll make sure not to cause any further confusion, as you recommended!