Oh yeah, one more thing which I think actually might be the most important point. On a lot of these benchmarks — at the very least, on SuperGLUE — “human-level performance” is a much weaker requirement than “human equivalence.” Human performance isn’t necessarily an indicator of irreducible entropy in the underlying task. To a large extent, it just reflects ambiguity or coarseness in the dataset specification. A big part of this is the artificial setting of the data annotation, which is unfortunately kind of necessary in a lot of cases when the goal is characterizing abstract language understanding. On the long tail of examples that require more careful reasoning, in the absence of an underlying business reason or extra context to guide people’s judgments, they end up interpreting or construing the inputs differently, or applying the annotation guidelines differently, and disagreeing with each other on the output. Human-equivalence would mean sensitivity to all of the issues that lead a human to decide on one interpretation over another, but in the IID performance evaluation setting, these issues all basically wash out looking like noise. Then as long as the model can throw out a reasonable guess of one of the plausible labels in these cases, it will be hard to distinguish from a human. NLI/RTE are great examples of this; it is a notoriously difficult problem to specify, and recent work has shown a great deal of disagreement between annotators in this task setting (including bi-modal distributions indicative of explicit ambiguities in construal), to the point that it seems like supervised models probably have already more or less hit the noise ceiling, at least on the SNLI/MultiNLI datasets.
So for the most part when looking at these kinds of datasets — particularly data annotated in an artificial setting intended to capture something abstract and general about language — maximum performance on IID test sets is best seen not as the point of irreducible entropy (i.e., maximum performance on a task), but as a “noise ceiling” beyond which the particular dataset is incapable of distinguishing between models. See Schlangen for more on the relationship between data and task.
It’s an open question how to more carefully and accurately benchmark something resembling human-equivalence, but pretty good concrete examples of how one might try to do this include Contrast Sets and adversarial evaluation. Contrast sets look at model behavior under perturbation of inputs, testing the sensitivity of the decision function to certain changes; adversarial evaluation involves explicitly searching for evaluation items on which humans agree and the model is incorrect. In theory, these sorts of evaluations will do a much better job of evaluating model robustness under distribution shift or in the course of interaction with real users; they pose what amounts to a much tighter constraint on the decision function learned by the model. Indeed, in practice models do much worse under these evaluations than traditional IID ones (at least, for many tasks). This is expected, since pretty much all models these days are trained using empirical risk minimization (i.e., the IID assumption). GPT-3′s few-shot learning setting is partly interesting because it does not use the IID assumption, instead using a language modeling assumption (which lets it leverage lots of what it learned, but may indeed impose other constraints on its output).
But still, ANLI is an example of an approach to adversarial evaluation, and it is far at the bottom of your graphs in terms of GPT-3′s performance. Also notice that running GPT-3 on ANLI is not technically an adversarial evaluation; the full evaluation process for ANLI would involve (human) searching for examples that GPT-3 will get wrong, evaluating on those, feeding them back in for training, etc.; even the modest uptick at the end for GPT-3 might disappear when doing a true adversarial evaluation.
So all that is to say I would also look at the relationship between human-level performance on a benchmark and human-equivalent performance on a task with a big grain of salt. Most of our datasets are really bad at assessing human-equivalence in the high-performance regime, and our models only recently got good enough for this to become a problem (and it is a problem which is now the subject of a lot of attention in the field). This is much less of an issue when you’re using supervised ML for some business purpose where your training and test sets are essentially the same distribution and IID, and your labels directly manifest your business need. But it’s a big problem for “general language understanding.”
Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.
I don’t think this matters much in practice, though, because humans and ML are really differently designed, so we’re bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they’ll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. “when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?”), but I think it’s also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting [1].
To make that a little bit more precise, I want to answer the question “When will transformative AI be created?”. Exactly what group of AI technologies would or wouldn’t be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes “When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length”
This is an annoyingly vague question, for sure. However, I currently suspect it’s more fruitful to think about this from the perspective of “How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?” than to think about this from the perspective of “When will be unable to generate questions that GPT-N fails at?”
Another lens on this is to look at tasks that have metrics other than how well AI can imitate humans. Computers beat us at chess in the 90s, but I think humans are still better in some situations, since human-AI teams do better than AIs alone. If we had evaluated chess engines on the metric of beating humans in every situation, we would have overestimated the point at which AIs beat us at chess by at least 20 years
(Though in the case of GPT-N, this analogy is complicated by the fact that GPT-3 doesn’t have any training signal other than imitating humans.)
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.
Oh yeah, one more thing which I think actually might be the most important point. On a lot of these benchmarks — at the very least, on SuperGLUE — “human-level performance” is a much weaker requirement than “human equivalence.” Human performance isn’t necessarily an indicator of irreducible entropy in the underlying task. To a large extent, it just reflects ambiguity or coarseness in the dataset specification. A big part of this is the artificial setting of the data annotation, which is unfortunately kind of necessary in a lot of cases when the goal is characterizing abstract language understanding. On the long tail of examples that require more careful reasoning, in the absence of an underlying business reason or extra context to guide people’s judgments, they end up interpreting or construing the inputs differently, or applying the annotation guidelines differently, and disagreeing with each other on the output. Human-equivalence would mean sensitivity to all of the issues that lead a human to decide on one interpretation over another, but in the IID performance evaluation setting, these issues all basically wash out looking like noise. Then as long as the model can throw out a reasonable guess of one of the plausible labels in these cases, it will be hard to distinguish from a human. NLI/RTE are great examples of this; it is a notoriously difficult problem to specify, and recent work has shown a great deal of disagreement between annotators in this task setting (including bi-modal distributions indicative of explicit ambiguities in construal), to the point that it seems like supervised models probably have already more or less hit the noise ceiling, at least on the SNLI/MultiNLI datasets.
So for the most part when looking at these kinds of datasets — particularly data annotated in an artificial setting intended to capture something abstract and general about language — maximum performance on IID test sets is best seen not as the point of irreducible entropy (i.e., maximum performance on a task), but as a “noise ceiling” beyond which the particular dataset is incapable of distinguishing between models. See Schlangen for more on the relationship between data and task.
It’s an open question how to more carefully and accurately benchmark something resembling human-equivalence, but pretty good concrete examples of how one might try to do this include Contrast Sets and adversarial evaluation. Contrast sets look at model behavior under perturbation of inputs, testing the sensitivity of the decision function to certain changes; adversarial evaluation involves explicitly searching for evaluation items on which humans agree and the model is incorrect. In theory, these sorts of evaluations will do a much better job of evaluating model robustness under distribution shift or in the course of interaction with real users; they pose what amounts to a much tighter constraint on the decision function learned by the model. Indeed, in practice models do much worse under these evaluations than traditional IID ones (at least, for many tasks). This is expected, since pretty much all models these days are trained using empirical risk minimization (i.e., the IID assumption). GPT-3′s few-shot learning setting is partly interesting because it does not use the IID assumption, instead using a language modeling assumption (which lets it leverage lots of what it learned, but may indeed impose other constraints on its output).
But still, ANLI is an example of an approach to adversarial evaluation, and it is far at the bottom of your graphs in terms of GPT-3′s performance. Also notice that running GPT-3 on ANLI is not technically an adversarial evaluation; the full evaluation process for ANLI would involve (human) searching for examples that GPT-3 will get wrong, evaluating on those, feeding them back in for training, etc.; even the modest uptick at the end for GPT-3 might disappear when doing a true adversarial evaluation.
So all that is to say I would also look at the relationship between human-level performance on a benchmark and human-equivalent performance on a task with a big grain of salt. Most of our datasets are really bad at assessing human-equivalence in the high-performance regime, and our models only recently got good enough for this to become a problem (and it is a problem which is now the subject of a lot of attention in the field). This is much less of an issue when you’re using supervised ML for some business purpose where your training and test sets are essentially the same distribution and IID, and your labels directly manifest your business need. But it’s a big problem for “general language understanding.”
Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.
I don’t think this matters much in practice, though, because humans and ML are really differently designed, so we’re bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they’ll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. “when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?”), but I think it’s also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting [1].
To make that a little bit more precise, I want to answer the question “When will transformative AI be created?”. Exactly what group of AI technologies would or wouldn’t be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes “When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length”
This is an annoyingly vague question, for sure. However, I currently suspect it’s more fruitful to think about this from the perspective of “How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?” than to think about this from the perspective of “When will be unable to generate questions that GPT-N fails at?”
Another lens on this is to look at tasks that have metrics other than how well AI can imitate humans. Computers beat us at chess in the 90s, but I think humans are still better in some situations, since human-AI teams do better than AIs alone. If we had evaluated chess engines on the metric of beating humans in every situation, we would have overestimated the point at which AIs beat us at chess by at least 20 years
(Though in the case of GPT-N, this analogy is complicated by the fact that GPT-3 doesn’t have any training signal other than imitating humans.)
Though being concerned about safety, I would be delighted if people became very serious about adversarial testing.
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.