Humans achieve over 95% accuracy, while no model surpasses 50% accuracy. (2019)
A series on benchmarks does seem very interesting and useful—but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.
Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.
But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?
I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.
Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!
If that’s your belief, I think you should edit in a disclaimer to your TL;DR section, like “Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology”.
Also, the numbers aren’t “non-provable”: anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)
Thanks for the recommendation, though I’ll think of a more fundamental solution to satisfy all ethical/communal concerns.
”Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology.” Regarding this, just to sort everything out, because I’m writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It’s just me questioning everything when I still can as a student. But I’ll make sure not to cause any further confusion, as you recommended!
A series on benchmarks does seem very interesting and useful—but you really gotta report more recent model results than from 2019!! GPT-4 allegedly surpasses 95.3% on HellaSwag, making that initial claim in the post very misleading.
Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.
But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?
I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.
Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!
If that’s your belief, I think you should edit in a disclaimer to your TL;DR section, like “Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology”.
Also, the numbers aren’t “non-provable”: anyone could just replicate them with the GPT-4 API! (Modulo dataset contamination considerations.)
Thanks for the recommendation, though I’ll think of a more fundamental solution to satisfy all ethical/communal concerns.
”Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology.” Regarding this, just to sort everything out, because I’m writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It’s just me questioning everything when I still can as a student. But I’ll make sure not to cause any further confusion, as you recommended!