Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are ‘atomic’, so the measurement regress continues… The fact that someone like Paul can come along afterwards and tell you “ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark” would not matter.
Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement… and it also changes the scaling for several of the benchmarks—what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, “Web of Lies” goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don’t know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either.
“Sampling can show the presence of knowledge but not the absence.”
Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are ‘atomic’, so the measurement regress continues… The fact that someone like Paul can come along afterwards and tell you “ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark” would not matter.
Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement… and it also changes the scaling for several of the benchmarks—what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, “Web of Lies” goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don’t know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either.
“Sampling can show the presence of knowledge but not the absence.”