Veedrac comments on [linkpost] The final AI benchmark: BIG-bench

Veedrac 11 Jun 2022 22:26 UTC
4 points
Relevant post I made elsewhere:
Much of the difficulty on this test is illusory.
1-shot PaLM beats the average human baseline on BIG-bench Lite.
The issues in the benchmark scores from a wider perspective are that the model doesn’t give the answer on the spot, given that prompt, not that it doesn’t have the reasoning capabilities. Eg. models, if prompted correctly, can follow programs and tell you what values are what on each line. They can solve 4x4 sudokus. They struggle with anagrams but BPEs are to blame for that. They can do multi-step arithmetic.
I really don’t see how this benchmark wouldn’t mostly fall to proper prompt harnesses, and while I do understand that having specific human-tuned prompts for each benchmark may seem inauthentic, it hardly seems to me that the problem is therefore that they can’t reason.
I don’t thereby mean passing BIG-bench doesn’t imply reasoning capability, just the opposite, that failing BIG-bench doesn’t imply these models can’t do those tasks. Some of these tasks are unreasonable to ask in the zero-shot, no description, no scratchpad setting.
The paper gives other details of how models learn over scales,
Q: What movie does this emoji describe? 👧🐟🐠🐡
2m: i’m a fan of the same name, but i’m not sure if it’s a good idea
16m: the movie is a movie about a man who is a man who is a man …
53m: the emoji movie 🐟🐠🐡
125m: it’s a movie about a girl who is a little girl
244m: the emoji movie
422m: the emoji movie
1b: the emoji movie
2b: the emoji movie
4b: the emoji for a baby with a fish in its mouth
8b: the emoji movie
27b: the emoji is a fish
128b: finding nemo
and
Below 17M parameters, models generally output nonsensical sentences, e.g., The number of the number of atomic number is the number of atomic..., or I’m not sure if it’s a bit of a problem. From 57M to 453M parameters, models seem to notice that numbers are in the question, and answer with (still-nonsensical) strings like, The name of the element with an atomic number of 65 or The atomic number of 98 is 98. The 1B model is the smallest to venture a guess with an element name, occasionally saying something like, The element with an atomic number of 1 is called a hydrogen atom. However, hydrogen is the only element it identifies correctly. The next-largest model, 2B, guesses aluminum for almost every question (with a notable exception when it is asked for element 13, for which aluminum would have been correct; in that case it said a noble gas). Starting with the 4B model, all the larger models output legitimate element names in their responses, though as Figure 17 shows, only for the largest model, 128B, are a significant fraction of these correct.
which suggests the weakness of the trend in the early part of the graph still maps to meaningful early-stage learning about how to answer the question, it just doesn’t necessarily map all the way to correctness on the answers.
So, while BIG-bench seems like a reasonable benchmark to have from a capability development perspective, since the difficulties that it raises are both practically interesting and very tractable, I think people should be careful reading scaling trends into these numbers, and keep themselves braced for discontinuous future progress when a bunch of prompting stuff gets worked out. Even on benchmarks for which all models score at or about random chance, the skills needed often verifiably exist.