I don’t think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won’t, though it is evidence. Maybe the data used to measure the behavior doesn’t successfully prompt the model to do the behavior, maybe it’s phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.
Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are ‘atomic’, so the measurement regress continues… The fact that someone like Paul can come along afterwards and tell you “ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark” would not matter.
Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement… and it also changes the scaling for several of the benchmarks—what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, “Web of Lies” goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don’t know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either.
“Sampling can show the presence of knowledge but not the absence.”
I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
I think we’ll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we’ll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.
I don’t think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won’t, though it is evidence. Maybe the data used to measure the behavior doesn’t successfully prompt the model to do the behavior, maybe it’s phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.
Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you break it down to individual Bench components, and there is no reason to think the individual components are ‘atomic’, so the measurement regress continues… The fact that someone like Paul can come along afterwards and tell you “ah, but the perplexity would have been smooth if only you had chosen the right subset of datapoints to measure progress on as your true benchmark” would not matter.
Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”, Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement… and it also changes the scaling for several of the benchmarks—what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, “Web of Lies” goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don’t know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either.
“Sampling can show the presence of knowledge but not the absence.”
Agreed. I’d also add:
I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
I think we’ll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we’ll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.