Strong upvote because I want to signal boost this paper, though I think “It provides some evidence against the idea that “understanding is discontinuous”″ is too strong and this is actually very weak evidence.
Main ideas:
Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable.
Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper.
The authors find support for this hypothesis via:
Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8.
Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy.
Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it’s costly to train multiple); Figure 7.
What I’m taking away besides the above:
I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It’s likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go “oh deception will be emergent so there’s no way to predict it ahead of time.” This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models.
The paper doesn’t say “emergence isn’t a thing, nothing to worry about here,” despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities.
Strong upvote because I want to signal boost this paper, though I think “It provides some evidence against the idea that “understanding is discontinuous”″ is too strong and this is actually very weak evidence.
Main ideas:
Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable.
Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper.
The authors find support for this hypothesis via:
Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8.
Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy.
Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it’s costly to train multiple); Figure 7.
What I’m taking away besides the above:
I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It’s likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go “oh deception will be emergent so there’s no way to predict it ahead of time.” This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models.
The paper doesn’t say “emergence isn’t a thing, nothing to worry about here,” despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities.