Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.
I don’t think this matters much in practice, though, because humans and ML are really differently designed, so we’re bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they’ll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. “when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?”), but I think it’s also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting [1].
To make that a little bit more precise, I want to answer the question “When will transformative AI be created?”. Exactly what group of AI technologies would or wouldn’t be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes “When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length”
This is an annoyingly vague question, for sure. However, I currently suspect it’s more fruitful to think about this from the perspective of “How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?” than to think about this from the perspective of “When will be unable to generate questions that GPT-N fails at?”
Another lens on this is to look at tasks that have metrics other than how well AI can imitate humans. Computers beat us at chess in the 90s, but I think humans are still better in some situations, since human-AI teams do better than AIs alone. If we had evaluated chess engines on the metric of beating humans in every situation, we would have overestimated the point at which AIs beat us at chess by at least 20 years
(Though in the case of GPT-N, this analogy is complicated by the fact that GPT-3 doesn’t have any training signal other than imitating humans.)
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.
Thanks! I agree that if we required GPT-N to beat humans on every benchmark question that we could throw at them, then we would have a much more difficult task.
I don’t think this matters much in practice, though, because humans and ML are really differently designed, so we’re bound to be randomly better at some things and randomly worse at some things. By the time ML is better than humans at all things, I think they’ll already be vastly better at most things. And I care more about the point when ML will first surpass humans at most things. This is most clearly true when considering all possible tasks (e.g. “when will AIs beat humans at surviving on a savannah in a band of hunter-gatherers?”), but I think it’s also true when considering questions of varying difficulty in a fairly narrow benchmark. Looking at the linked papers, I think contrastive learning seems like a fair challenge; but I suspect that enough rounds of ANLI could yield questions that would be very rare in a normal setting [1].
To make that a little bit more precise, I want to answer the question “When will transformative AI be created?”. Exactly what group of AI technologies would or wouldn’t be transformative is an open question, but I think one good candidate is AI that can do the vast majority of economically important tasks cheaper than a human. If I then adopt the horizon-length frame (which I find plausible but not clearly correct), the relevant question for GPT-N becomes “When will GPT-N be able to perform (for less cost than a human) the vast majority of all economically relevant sub-tasks with a 1-token horizon length”
This is an annoyingly vague question, for sure. However, I currently suspect it’s more fruitful to think about this from the perspective of “How high reliability do we need for typical jobs? How expensive would it be to make GPT-N that reliable?” than to think about this from the perspective of “When will be unable to generate questions that GPT-N fails at?”
Another lens on this is to look at tasks that have metrics other than how well AI can imitate humans. Computers beat us at chess in the 90s, but I think humans are still better in some situations, since human-AI teams do better than AIs alone. If we had evaluated chess engines on the metric of beating humans in every situation, we would have overestimated the point at which AIs beat us at chess by at least 20 years
(Though in the case of GPT-N, this analogy is complicated by the fact that GPT-3 doesn’t have any training signal other than imitating humans.)
Though being concerned about safety, I would be delighted if people became very serious about adversarial testing.
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.