Yes, but I think that’s exactly what I haven’t seen. When I’ve seen benchmarks that try to do this, I’ve seen either:
That specific benchmark is not actually very smooth OR
The relationship of that benchmark to the task at hand came apart at unexpected time
Can you give some examples?
I don’t think people have created good benchmarks for things like “ability to hack into computers” but I suspect this is partly because relatively little effort has gone into making good benchmarks IMO. Even for relatively basic things like mathematical problem solving, we have very few high quality benchmarks, and this doesn’t seem explained by people trying hard but failing. I suspect we just don’t have that much effort going into creating good benchmarks.
But we do have lots of benchmarks for non-useful things, and the paper is just saying that these benchmarks show smooth performance.
Insofar as you’re saying that progress on existing benchmarks doesn’t actually look smooth, it sounds like you’re not responding to the contribution of the paper, which was that you can perform a simple modification to the performance metric to make performance look smooth as a function of scale (e.g. rather than looking at accuracy you can look at edit distance). Perhaps you disagree, but I think the results in this paper straightforwardly undermine the idea that progress has been non-smooth as measured by benchmarks.
I’d particularly like to see a specific example of “relationship of that benchmark to the task at hand came apart at unexpected time”.
Sorry for not responding to this. Examples do seem great, though digging up the exact charts I remember has turned out to be a bit of a longer time investment.
Some quick things I remembered feeling not that informative:
Go performance measured in ELO felt pretty hard to forecast from this kind of graph
Things like “When does chain-of-thought reasoning work?” for LLMs
LLM performance on various arithmetic tasks
Things like Alphafold, where I feel like there was basically no precursor. I remember there being forecasts about DL and protein folding, and I feel like none of them were very informative about when it would actually fall.
Sorry again for not linking to things. I might get around writing a post on this, since I do think it really deserves more exploration, but time is short these days.
I think asking for non-smoothness to call something an emergent property is unreasonable. If a performance graph is precisely an S-curve along a reasonable metric, it is reasonable to call that emergent, although it is perfectly smooth you can reparametrize to make it seem linear etc.
I haven’t looked at the paper to see what it’s substance is, but from the description alone it could be a mathematical sleight of hand.
Couldn’t the opposite critique easily be made? If some metric looks linear, then you could easily reparameterize it to make it look non-linear, and then call it emergent. That makes any claim about emergence trivial, if all you mean by emergence is that it arises non-linearly.
The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
Compare two possible claims about some emergent ability:
“At the 10^28 training FLOP level, LLMs will suddenly get the ability to hack into computers competently.”
“At some training FLOP level—which cannot be predicted ahead of time—LLMs will suddenly get the ability to hack into computers competently.”
Both claims are worrisome, since both imply that at some point we will go from having LLMs that can’t hack into other computers, to LLMs that can. But I would be way more worried if the second claim is true, compared to the first.
The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
Of course you can pick a reparameterization in hindsight, but without the benefit of hindsight, which reparameterization, exactly...?
What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away. No one designed any of these Big-Bench or other tasks to display emergence, and most of the initial dozen or so examples weren’t even particularly highlighted by the original authors back when I was collecting them to try to convince people that this was an actual thing which actually happened and was worth trying to understand (particularly connections to inner-monologue, hidden scaling, and U-shaped scaling).
When emergence happens on an obvious natural metric like accuracy, chosen independently of any scaling considerations at all, which often maps onto real world rewards and loss functions, then I am surprised. When un-emergence is retrodicted by the choice of metrics like… [checks notes]… ‘arithmetic accuracy expressed as a function of edit distance on BPE tokens’ (and a different one for each un-emergence) in order to explain away previously observed emergence and this retrodiction is being advertised to all and sundry as evidence of ‘predicting emergence’, then I am surprised in an entirely different way.
What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away.
It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.
You can reparametrize any monotonous function to make it linear.
This can be used to predict the function
Are wildly different claims. The point is that it’s always easy to do 1. in retrospect and this has no bearing whatsoever on 2.
I think we would agree that (Log-) Flops or parameters or some mild combination of those would count as a reasonable metric?
I’m not a statistician, but from what I know it should be extremely hard to predict S-curves before their inflection point, in particular if there’s no guarantee that what you’re predicting is literally a logistic function.
That being said, trying to create benchmarks for all kinds of tasks seems like a reasonable thing to do in an case.
Can you give some examples?
I don’t think people have created good benchmarks for things like “ability to hack into computers” but I suspect this is partly because relatively little effort has gone into making good benchmarks IMO. Even for relatively basic things like mathematical problem solving, we have very few high quality benchmarks, and this doesn’t seem explained by people trying hard but failing. I suspect we just don’t have that much effort going into creating good benchmarks.
But we do have lots of benchmarks for non-useful things, and the paper is just saying that these benchmarks show smooth performance.
Insofar as you’re saying that progress on existing benchmarks doesn’t actually look smooth, it sounds like you’re not responding to the contribution of the paper, which was that you can perform a simple modification to the performance metric to make performance look smooth as a function of scale (e.g. rather than looking at accuracy you can look at edit distance). Perhaps you disagree, but I think the results in this paper straightforwardly undermine the idea that progress has been non-smooth as measured by benchmarks.
I’d particularly like to see a specific example of “relationship of that benchmark to the task at hand came apart at unexpected time”.
Sorry for not responding to this. Examples do seem great, though digging up the exact charts I remember has turned out to be a bit of a longer time investment.
Some quick things I remembered feeling not that informative:
Go performance measured in ELO felt pretty hard to forecast from this kind of graph
Things like “When does chain-of-thought reasoning work?” for LLMs
LLM performance on various arithmetic tasks
Things like Alphafold, where I feel like there was basically no precursor. I remember there being forecasts about DL and protein folding, and I feel like none of them were very informative about when it would actually fall.
Sorry again for not linking to things. I might get around writing a post on this, since I do think it really deserves more exploration, but time is short these days.
I think asking for non-smoothness to call something an emergent property is unreasonable. If a performance graph is precisely an S-curve along a reasonable metric, it is reasonable to call that emergent, although it is perfectly smooth you can reparametrize to make it seem linear etc.
I haven’t looked at the paper to see what it’s substance is, but from the description alone it could be a mathematical sleight of hand.
Couldn’t the opposite critique easily be made? If some metric looks linear, then you could easily reparameterize it to make it look non-linear, and then call it emergent. That makes any claim about emergence trivial, if all you mean by emergence is that it arises non-linearly.
The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
Compare two possible claims about some emergent ability:
“At the 10^28 training FLOP level, LLMs will suddenly get the ability to hack into computers competently.”
“At some training FLOP level—which cannot be predicted ahead of time—LLMs will suddenly get the ability to hack into computers competently.”
Both claims are worrisome, since both imply that at some point we will go from having LLMs that can’t hack into other computers, to LLMs that can. But I would be way more worried if the second claim is true, compared to the first.
Of course you can pick a reparameterization in hindsight, but without the benefit of hindsight, which reparameterization, exactly...?
What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away. No one designed any of these Big-Bench or other tasks to display emergence, and most of the initial dozen or so examples weren’t even particularly highlighted by the original authors back when I was collecting them to try to convince people that this was an actual thing which actually happened and was worth trying to understand (particularly connections to inner-monologue, hidden scaling, and U-shaped scaling).
When emergence happens on an obvious natural metric like accuracy, chosen independently of any scaling considerations at all, which often maps onto real world rewards and loss functions, then I am surprised. When un-emergence is retrodicted by the choice of metrics like… [checks notes]… ‘arithmetic accuracy expressed as a function of edit distance on BPE tokens’ (and a different one for each un-emergence) in order to explain away previously observed emergence and this retrodiction is being advertised to all and sundry as evidence of ‘predicting emergence’, then I am surprised in an entirely different way.
It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.
You can reparametrize any monotonous function to make it linear.
This can be used to predict the function
Are wildly different claims. The point is that it’s always easy to do 1. in retrospect and this has no bearing whatsoever on 2.
I think we would agree that (Log-) Flops or parameters or some mild combination of those would count as a reasonable metric?
I’m not a statistician, but from what I know it should be extremely hard to predict S-curves before their inflection point, in particular if there’s no guarantee that what you’re predicting is literally a logistic function.
That being said, trying to create benchmarks for all kinds of tasks seems like a reasonable thing to do in an case.