1-shot PaLM beats the average human baseline on BIG-bench Lite.
The issues in the benchmark scores from a wider perspective are that the model doesn’t give the answer on the spot, given that prompt, not that it doesn’t have the reasoning capabilities. Eg. models, if prompted correctly, can follow programs and tell you what values are what on each line. They can solve 4x4 sudokus. They struggle with anagrams but BPEs are to blame for that. They can do multi-step arithmetic.
I really don’t see how this benchmark wouldn’t mostly fall to proper prompt harnesses, and while I do understand that having specific human-tuned prompts for each benchmark may seem inauthentic, it hardly seems to me that the problem is therefore that they can’t reason.
I don’t thereby mean passing BIG-bench doesn’t imply reasoning capability, just the opposite, that failing BIG-bench doesn’t imply these models can’t do those tasks. Some of these tasks are unreasonable to ask in the zero-shot, no description, no scratchpad setting.
The paper gives other details of how models learn over scales,
Q: What movie does this emoji describe? 👧🐟🐠🐡
2m: i’m a fan of the same name, but i’m not sure if it’s a good idea 16m: the movie is a movie about a man who is a man who is a man … 53m: the emoji movie 🐟🐠🐡 125m: it’s a movie about a girl who is a little girl 244m: the emoji movie 422m: the emoji movie 1b: the emoji movie 2b: the emoji movie 4b: the emoji for a baby with a fish in its mouth 8b: the emoji movie 27b: the emoji is a fish 128b: finding nemo
and
Below 17M parameters, models generally output nonsensical sentences, e.g., The number of the number of atomic number is the number of atomic..., or I’m not sure if it’s a bit of a problem. From 57M to 453M parameters, models seem to notice that numbers are in the question, and answer with (still-nonsensical) strings like, The name of the element with an atomic number of 65 or The atomic number of 98 is 98. The 1B model is the smallest to venture a guess with an element name, occasionally saying something like, The element with an atomic number of 1 is called a hydrogen atom. However, hydrogen is the only element it identifies correctly. The next-largest model, 2B, guesses aluminum for almost every question (with a notable exception when it is asked for element 13, for which aluminum would have been correct; in that case it said a noble gas). Starting with the 4B model, all the larger models output legitimate element names in their responses, though as Figure 17 shows, only for the largest model, 128B, are a significant fraction of these correct.
which suggests the weakness of the trend in the early part of the graph still maps to meaningful early-stage learning about how to answer the question, it just doesn’t necessarily map all the way to correctness on the answers.
So, while BIG-bench seems like a reasonable benchmark to have from a capability development perspective, since the difficulties that it raises are both practically interesting and very tractable, I think people should be careful reading scaling trends into these numbers, and keep themselves braced for discontinuous future progress when a bunch of prompting stuff gets worked out. Even on benchmarks for which all models score at or about random chance, the skills needed often verifiably exist.
Relevant post I made elsewhere:
I don’t thereby mean passing BIG-bench doesn’t imply reasoning capability, just the opposite, that failing BIG-bench doesn’t imply these models can’t do those tasks. Some of these tasks are unreasonable to ask in the zero-shot, no description, no scratchpad setting.
The paper gives other details of how models learn over scales,
and
which suggests the weakness of the trend in the early part of the graph still maps to meaningful early-stage learning about how to answer the question, it just doesn’t necessarily map all the way to correctness on the answers.
So, while BIG-bench seems like a reasonable benchmark to have from a capability development perspective, since the difficulties that it raises are both practically interesting and very tractable, I think people should be careful reading scaling trends into these numbers, and keep themselves braced for discontinuous future progress when a bunch of prompting stuff gets worked out. Even on benchmarks for which all models score at or about random chance, the skills needed often verifiably exist.