What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away.
It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.
It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.