You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture “human-likeness” beyond a certain threshold (i.e., the noise ceiling).
And then I’m making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don’t need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn’t been optimised for that benchmark suite).
What I’m saying is I really do not think that’s true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there:
The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It’s hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren’t specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category.
The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category.
For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains a significant challenge. The WSC, CommitmentBank, BoolQ and COPA datasets fall in this category. (Side note: these actually aren’t event necessarily inherently less noisy or better specified than the datasets in the second category, because often high-agreement is achieved by just filtering out the low-agreement inputs from the data; of course, these low-agreement inputs may be out there in the wild, and the correct output on those may rightly be considered underspecified.)
Possible exceptions to this generally are when the benchmark already very closely corresponds to a business need. Examples of this include Quora Question Pairs (QQP) from GLUE and BoolQ on SuperGLUE. On QQP, models were already superhuman at GLUE’s release time (we hadn’t calculated human performance at that point, oops). BoolQ is also potentially special in that it was actually annotated over search queries, and even with low human agreement, progress on that dataset is probably somewhat representative of progress for extractive QA with yes/no questions in the Google search context (especially because there is a high tolerance for failure in that setting).
I think it would be really cool if we could just automate a bunch of complex tasks like customer service using something like natural language questions of the kind that appear in these benchmarks. In fact, using QA as a proxy for other kinds of structure and tasks is a primary focus of myownresearch. I think that it might even be possible to make significant headway using this approach. But my sense is that the crucial last 10% of building a robust, usable system that can actually replace most of the value provided by a human in such a role requires a lot of software architecting, knowledge engineering, and domain-specific understanding in order to have a way to reliably align even a very powerful ML system with business goals. This is my understanding based on my own (brief, unsuccessful) work on natural language interfaces as well as discussions with people who work on comparatively successful projects like Google Assistant or Alexa chatbots. I’m not saying that I don’t think these things will get huge productivity boosts from automation very soon — rather, I suspect they will, and my impression is that startups like ASAPP and Cresta are making headway. Are recent big advances in ML a big part of this? Well, I don’t know, but I imagine so. A key enabling technology, even. But the people at these companies are probably not just working on reinforcement learning algorithms. I would suspect that they instead are working on advances of the kind that we see in Semantic Machines’s dataflow graphs, or Tesla Self Driving’s massive software stack.
And yeah, as ML advances, both SuperGLUE performance and productivity growth due to economically useful ML systems will continue. But beyond that, if some kind of phase shift in this productivity growth is expected (I admit I don’t really know what is meant by “transformative AI”), I don’t see a relationship between that and e.g. SuperGLUE performance, any more than performance on some other benchmark like the much older Penn Treebank. Benchmarks are beings of their time; they encode our best attempts at measuring progress, and their saturation serves primarily to guide us in our next steps.
You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture “human-likeness” beyond a certain threshold (i.e., the noise ceiling).
What I’m saying is I really do not think that’s true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there:
The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It’s hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren’t specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category.
The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category.
For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains a significant challenge. The WSC, CommitmentBank, BoolQ and COPA datasets fall in this category. (Side note: these actually aren’t event necessarily inherently less noisy or better specified than the datasets in the second category, because often high-agreement is achieved by just filtering out the low-agreement inputs from the data; of course, these low-agreement inputs may be out there in the wild, and the correct output on those may rightly be considered underspecified.)
Possible exceptions to this generally are when the benchmark already very closely corresponds to a business need. Examples of this include Quora Question Pairs (QQP) from GLUE and BoolQ on SuperGLUE. On QQP, models were already superhuman at GLUE’s release time (we hadn’t calculated human performance at that point, oops). BoolQ is also potentially special in that it was actually annotated over search queries, and even with low human agreement, progress on that dataset is probably somewhat representative of progress for extractive QA with yes/no questions in the Google search context (especially because there is a high tolerance for failure in that setting).
I think it would be really cool if we could just automate a bunch of complex tasks like customer service using something like natural language questions of the kind that appear in these benchmarks. In fact, using QA as a proxy for other kinds of structure and tasks is a primary focus of my own research. I think that it might even be possible to make significant headway using this approach. But my sense is that the crucial last 10% of building a robust, usable system that can actually replace most of the value provided by a human in such a role requires a lot of software architecting, knowledge engineering, and domain-specific understanding in order to have a way to reliably align even a very powerful ML system with business goals. This is my understanding based on my own (brief, unsuccessful) work on natural language interfaces as well as discussions with people who work on comparatively successful projects like Google Assistant or Alexa chatbots. I’m not saying that I don’t think these things will get huge productivity boosts from automation very soon — rather, I suspect they will, and my impression is that startups like ASAPP and Cresta are making headway. Are recent big advances in ML a big part of this? Well, I don’t know, but I imagine so. A key enabling technology, even. But the people at these companies are probably not just working on reinforcement learning algorithms. I would suspect that they instead are working on advances of the kind that we see in Semantic Machines’s dataflow graphs, or Tesla Self Driving’s massive software stack.
And yeah, as ML advances, both SuperGLUE performance and productivity growth due to economically useful ML systems will continue. But beyond that, if some kind of phase shift in this productivity growth is expected (I admit I don’t really know what is meant by “transformative AI”), I don’t see a relationship between that and e.g. SuperGLUE performance, any more than performance on some other benchmark like the much older Penn Treebank. Benchmarks are beings of their time; they encode our best attempts at measuring progress, and their saturation serves primarily to guide us in our next steps.