I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.
I guess my main concern here is — besides everything I wrote in my reply to you below — basically that reliability of GPT-N on simple, multiclass classification tasks lacking broader context may not be representative of its reliability in real-world automation settings. If we’re to take SuperGLUE as representative, well.. it’s already basically solved.
One of the problems here is that when you have the noise ceiling set so low, like it is in SuperGLUE, reaching human performance does not mean the model is reliable. It means the humans aren’t. It means you wouldn’t even trust a human to do this task if you really cared about the result. Coming up with tasks where humans can be reliable is actually quite difficult! And making humans reliable in the real world usually depends on them having an understanding of the rules they are to follow and the business stakes involved in their decisions — much broader context that is very difficult to distill into artificial annotation tasks.
So when it comes to reliable automation, it’s not clear to me that just looking at human performance on difficult benchmarks is a reasonable indicator. You’d want to look at reliability on tasks with clear economic viability, where the threshold of viability is clear. But the process of faithfully distilling economically viable tasks into benchmarks is a huge part of the difficulty in automation in the first place. And I have a feeling that where you can do this successfully, you might find that the task is either already subject to automation, or doesn’t necessarily require huge advances in ML in order to become viable.