The BIG-Bench paper that those ‘human’ numbers are coming from (unpublished, quasi-public as TeX here) cautions against taking those average very seriously, without giving complete details about who the humans are or how they were asked/incentivized to behave on tasks that required specialized skills:
Thank you for this important caveat. As an imperfect bayesian, I expect that if I analyzed the benchmark, I would update towards a belief that the results are real, but less impressive than the article makes them appear.
The BIG-Bench paper that those ‘human’ numbers are coming from (unpublished, quasi-public as TeX here) cautions against taking those average very seriously, without giving complete details about who the humans are or how they were asked/incentivized to behave on tasks that required specialized skills:
Thank you for this important caveat. As an imperfect bayesian, I expect that if I analyzed the benchmark, I would update towards a belief that the results are real, but less impressive than the article makes them appear.
:)