I’m not saying that all benchmarks are necessarily hard, I’m saying that these ones look pretty hard to me (compared with ~ordinary conversation).
I’m not sure exactly what you mean here, but if you mean “holding an ordinary conversation with a human” as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There’s a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it’s quite gameable. This is why the Winograd Schema Challenge was proposed, but even that and new proposed versions of it have seen lots of progress recently — at the end of the day it turns out to be hard to write very difficult tests even in the WSC format, for all the reasons related to shallow heuristic learning etc.; the problem is that our subjective assessment of the difficulty of a dataset generally assumes the human means of solving it and associated conceptual scaffolding, which is no constraint for an Alien God.
So to address the difference between a language model and a general-purpose few-shot learner:
In other words, I feel more happy about navigating with my personal sense of “How hard is this language task?” when we’re talking about few-shot learning than when we’re talking about finetuned models, becase finetuned models can entirely get by with heuristics that only work on a single benchmark, while few-shot learners use sets of heuristics that cover all tasks they’re exposed to. The latter seem far more likely to generalise to new tasks of similar difficulty (no matter if they do it via reasoning or via statistics).
I agree that we should expect its solutions to be much more general. The question at issue is: how does it learn to generalize? It is basically impossible to fully specify a task with a small training set and brief description — especially if the training set is only a couple of items. With so few examples, generalization behavior is almost entirely a matter of inductive bias. In the case of humans, this inductive bias comes from social mental modeling: the entire process of embodied language learning for a human trains us to be amazing at figuring out what you mean from what you say. In the case of GPT’s few-shot learning, the inductive bias comes entirely from a language modeling assumption, that the desired task output can be approximated using language modeling probabilities prefixed with a task description and a few I/O examples.
This gets us an incredible amount of mileage. But why, and how far will it get us? Well, you suggest:
In the limit of training on humans using language, we would have a perfect model of the average human in the training set, which would surely be able to achieve human performance on all tasks (though it wouldn’t do much better).
I don’t think this makes sense. If the data was gathered by taking humans into rooms, giving them randomly sampled task instructions, a couple input/output examples, and setting them loose, then maybe what you say would be true in the limit in some strict sense. But I don’t think it makes sense here to say the “average human” is captured in the training set. A language model is trained on found text. There is a huge variety of processes that give rise to this text, and it is the task of the system to model those processes. When it is prompted with a task description and I/O examples, it can only produce predictions by making sense of that prompt as found text, implicitly assigning some kind of generative process to it. In some sense, you can view it as doing an extremely smart interpolation between the texts that its seen before.
(Maybe you disagree with this because you think big enough data would include basically all possible such situations. I don’t think that is really right morally speaking, because I think the relative frequency of rarer and trickier situations (or underlying cognitive factors to explain situations, or whatever) in the LM setting will probably drop off precipitously to the point of being only infinitesimally useful for the model to learn certain kinds of nuanced things in training. If you were to control the data distribution to fix this, well then you’re not really doing language modeling but controlled massive multitask learning, and regardless, the following criticisms still apply.)
It is utterly remarkable how powerful this approach is and how well it works for “few-shot learning.” A lot of stuff can be learned from found text. But the approach has limits. It is incumbent on the person writing the prompt to figure out how to write it to coax the desired behavior out of the language model, whether the goal is to produce advice about dealing with bears, or avoid producing answers to nonsense questions. It is very easy to imagine how the language modeling assumption will produce reasonable and cool outputs for lots of tasks, but its output in corner cases might wildly vary depending on precise and unpredictable features of the instructions. The problem of crafting a “representative training set” has simply been replaced by another problem, of crafting “representative prompts.” It is very hard, at least for me, to imagine how the language modeling assumption will lead to the model behaving truly reliably in corner cases — when the standards get higher — without relying on the details of prompt selection in exactly the way that supervised models rely on training set construction. I have no reason to expect the language model to just know what I mean.
Another problem here, though, is we don’t know what we mean either. In the course of, erm, producing economic value, when we hit a case where the instructions are ambiguous, we leverage everything else which was not in the instructions — understanding of broader social goals, ethical standards, systems of accountability to which we’re subject, mental modeling of the rule writer, etc. — to decide how to handle it. GPT-3′s few-shot learning uses the language modeling assumption as a stand-in for this. It’s cool that they align in places — for example, the performance breakdown in the GPT-3 paper for the most part indicates that they align well on factual indexing and recall (which is where much of the gains were). but in general I would not expect them to be aligned, and I would expect the misalignment to be exposed to a greater degree as language models get better and standards get higher. Ultimately the language model serves as a very powerful starting point, but the question of how to reliably align it with an external goal remains open (or involves supervised learning, which leads us back to where we started but with a much beefier model). I find that the conception of GPT-3 as a general-purpose few-shot learner, as opposed to just a very powerful language model, tends to let all of these complexities get silently swept under the rug.
It’s worth dwelling on the “we don’t know what we mean” point a bit more. Because the same point applies to supervised learning; indeed, it’s precisely the reason that we end up developing datasets that we think require human-like skills, but which actually can be gamed and solved without them. So you might think what I’m saying here could be applied to just about any model and any benchmark: that perhaps we can extrapolate performance trends on the benchmark, but reaching the point of saturation indicates less that the underlying problems are solved in a way that generalizes beyond the benchmark, and more that the misalignment between the benchmark and the underlying goal is now laid bare.
And… that is basically my argument. Unless you’re optimizing for the exact thing that you care about, then Moloch will get your babies in the end. The thing is, most benchmarks were never intended for this kind of extrapolation (i.e., beyond the precise scope of the data). And the ones that were — for the most part, slated ambitiously as Grand Challenges of AI — have all been essentially debunked. The remaining reasonable use of a benchmark is to allow empirical comparisons between current systems, thereby facilitating progress in the best way we know how. When a benchmark becomes saturated, then we can examine why, and use the lessons to construct the next benchmark, and general progress is — ideally — steadily made.
Consider the Penn Treebank. This was one of the first big corpora of annotated English text used for NLP, constructed using newswire text from the Wall Street Journal written in the 80s. It centered on English syntax, and a lot of work focused on modeling the Penn Treebank with the idea that if you could get a model to produce correct syntactic analyses, i.e., understand syntax, then it could serve as a solid foundation for higher levels of language understanding, just as we think of it happening in humans. These days, error rates for models on Penn Treebank syntax are below estimated annotation error in gold — i.e., the dataset is solved. But syntax is not solved, in any meaningful sense, and the reasons why extend far beyond the standard problems with shallow heuristics. The thing is, even if we have “accurate” syntax, we have no idea how to use it! All the model learned is a function to output syntax trees; it doesn’t automatically come with all of the other facilities that we think relate to syntactic processing in humans, like how different syntactic analyses might affect the meaning of a sentence. For that, you need more tasks and more models, and the syntax trees are demoted to the level of input features. In the old days, they were part of a pipeline that slowly built up semantic features out of syntactic ones. But nowadays you might as well just use a neural net and forget your linguistics. So what was the point? Well, we made a lot of modeling progress along the way, and learned a lot of lessons about how syntax trees don’t get you lots of stuff you want.
These problems continue today. Some say BERT seems to “understand syntax” — since you can train a model to output syntactic trees with relative accuracy. But then when you look at how BERT-trained models actually make their decisions, they’re often not sensitive to syntax in the right ways. Just being able to predict something in one circumstance — even if it apparently generalizes out-of-domain — does not imply that the model contains a general facility for understanding that thing and using it in the ways that humans do. So this idea that you could have a model that just “generally understands language” and then throw it at some task in the world and expect it to succeed, where the model has no automatic mechanism to align with the world and the expectations on it, seems alien to me. All of our past experience with benchmarks, and all of my intuitions about how human labor works, seem to contradict it.
And that’s because, to reiterate an earlier argument: for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don’t really matter that much, with things like search/retrieval, ranking problems, etc., and ML can play a huge role in software systems, though the semantics of the ML components need to be carefully monitored and controlled to ensure consistent alignment with business goals. But it seems to me this is all basically already happening, and the core ML technology is already there to do a lot of this. Maybe you could argue that there is some kind of glide path from big LMs to transformative AI. But I have a really hard time seeing how benchmarks like SuperGLUE have bearing on it.
I’m not sure exactly what you mean here, but if you mean “holding an ordinary conversation with a human” as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There’s a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it’s quite gameable.
More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about “ordinary conversation” was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. “What does this pronoun refer to?”, “Does this entail or contradict this other hypothesis?”, “What will they say next?”, “Are they happy or sad?”, “Are they asking for a cheeseburger?”.
For some of these questions, my original claim follows trivially. “What does this pronoun refer to?” is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity.
And then I’m making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don’t need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn’t been optimised for that benchmark suite).
You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture “human-likeness” beyond a certain threshold (i.e., the noise ceiling).
And then I’m making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don’t need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn’t been optimised for that benchmark suite).
What I’m saying is I really do not think that’s true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there:
The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It’s hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren’t specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category.
The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category.
For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains a significant challenge. The WSC, CommitmentBank, BoolQ and COPA datasets fall in this category. (Side note: these actually aren’t event necessarily inherently less noisy or better specified than the datasets in the second category, because often high-agreement is achieved by just filtering out the low-agreement inputs from the data; of course, these low-agreement inputs may be out there in the wild, and the correct output on those may rightly be considered underspecified.)
Possible exceptions to this generally are when the benchmark already very closely corresponds to a business need. Examples of this include Quora Question Pairs (QQP) from GLUE and BoolQ on SuperGLUE. On QQP, models were already superhuman at GLUE’s release time (we hadn’t calculated human performance at that point, oops). BoolQ is also potentially special in that it was actually annotated over search queries, and even with low human agreement, progress on that dataset is probably somewhat representative of progress for extractive QA with yes/no questions in the Google search context (especially because there is a high tolerance for failure in that setting).
I think it would be really cool if we could just automate a bunch of complex tasks like customer service using something like natural language questions of the kind that appear in these benchmarks. In fact, using QA as a proxy for other kinds of structure and tasks is a primary focus of myownresearch. I think that it might even be possible to make significant headway using this approach. But my sense is that the crucial last 10% of building a robust, usable system that can actually replace most of the value provided by a human in such a role requires a lot of software architecting, knowledge engineering, and domain-specific understanding in order to have a way to reliably align even a very powerful ML system with business goals. This is my understanding based on my own (brief, unsuccessful) work on natural language interfaces as well as discussions with people who work on comparatively successful projects like Google Assistant or Alexa chatbots. I’m not saying that I don’t think these things will get huge productivity boosts from automation very soon — rather, I suspect they will, and my impression is that startups like ASAPP and Cresta are making headway. Are recent big advances in ML a big part of this? Well, I don’t know, but I imagine so. A key enabling technology, even. But the people at these companies are probably not just working on reinforcement learning algorithms. I would suspect that they instead are working on advances of the kind that we see in Semantic Machines’s dataflow graphs, or Tesla Self Driving’s massive software stack.
And yeah, as ML advances, both SuperGLUE performance and productivity growth due to economically useful ML systems will continue. But beyond that, if some kind of phase shift in this productivity growth is expected (I admit I don’t really know what is meant by “transformative AI”), I don’t see a relationship between that and e.g. SuperGLUE performance, any more than performance on some other benchmark like the much older Penn Treebank. Benchmarks are beings of their time; they encode our best attempts at measuring progress, and their saturation serves primarily to guide us in our next steps.
Cool, thanks. I agree that specifying the problem won’t get solved by itself. In particular, I don’t think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven’t been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it’s easier to game more narrow benchmarks, I don’t trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there’s not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn’t always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I’m as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there’s no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like “For each subtask, does it seem easier or harder than the ~solved benchmark tasks?” (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we’re trying to automate is “farther away” from the task of predicting internet corpora, we should adjust for how much finetuning we’ll need to make up for that.
We’ll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me “We’ll reach near-optimal performance on benchmark X with <100 examples in 2022” I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn’t about “benchmarks” in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it’s a convenient list of tasks to look at.
for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don’t really matter that much, with things like search/retrieval, ranking problems,
I’m not sure what you’re trying to say here? My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it’s sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
The bird example is good. My contention is basically that when it comes to making something like “recognizing birds” economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others:
Alignment. What do we mean by “recognize birds”? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded?
Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value?
I’ll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it’s because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business’s needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn’t tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecifiedfrom the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize the full space of contingencies you want the worker to deal with, perhaps enforcing it by the architecture of your app or business model. And this is where the design, software engineering, and domain-specific understanding aspects come in. Because no matter how powerful our ML systems are, we only want to use them if they’re aligned (or if, say, we have some kind of bound on how pathologically they may behave, or whatever), and knowing that is in general very hard. More powerful ML does make the construction of such systems easier, but is in some way orthogonal to the alignment problem. I would make this more concrete but I’m tired so I hope the concrete examples I gave earlier in the discussion serve as inspiration enough.
And, yeah. I should also clarify that my position is in some way contingent on ML already being good enough to eat all kinds of stuff that 10 years ago would be unheard of. I don’t mean to dunk on ML’s economic value. But basically what I think is that a lot of pretty transformative AI is already here. The market has taken up a lot of it, but I’m sure there’s plenty more slack to be made up in terms of productivity gains from today’s (and yesterday’s) ML. This very well might result in a doubling of worker productivity, which we’ve seen many times before and which seems to meet some definition of “producing the majority of the economic value that a human is capable of.” Maybe if I had a better sense of the vision of “transformative AI” I would be able to see more clearly how ML progress relates to it. But again, even then I don’t necessarily see the connection to extrapolation on benchmarks, which are inherently just measuring sticks of their day and kind of separate from the economic questions.
Anyway, thanks for engaging. I’m probably going to duck out of responding further because of holiday and other duties, but I’ve enjoyed this exchange. It’s been a good opportunity to express, refine, & be challenged on my views. I hope you’ve felt that it’s productive as well.
This has definitely been productive for me. I’ve gained useful information, I see some things more clearly, and I’ve noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!
I’m not sure exactly what you mean here, but if you mean “holding an ordinary conversation with a human” as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There’s a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it’s quite gameable. This is why the Winograd Schema Challenge was proposed, but even that and new proposed versions of it have seen lots of progress recently — at the end of the day it turns out to be hard to write very difficult tests even in the WSC format, for all the reasons related to shallow heuristic learning etc.; the problem is that our subjective assessment of the difficulty of a dataset generally assumes the human means of solving it and associated conceptual scaffolding, which is no constraint for an Alien God.
So to address the difference between a language model and a general-purpose few-shot learner:
I agree that we should expect its solutions to be much more general. The question at issue is: how does it learn to generalize? It is basically impossible to fully specify a task with a small training set and brief description — especially if the training set is only a couple of items. With so few examples, generalization behavior is almost entirely a matter of inductive bias. In the case of humans, this inductive bias comes from social mental modeling: the entire process of embodied language learning for a human trains us to be amazing at figuring out what you mean from what you say. In the case of GPT’s few-shot learning, the inductive bias comes entirely from a language modeling assumption, that the desired task output can be approximated using language modeling probabilities prefixed with a task description and a few I/O examples.
This gets us an incredible amount of mileage. But why, and how far will it get us? Well, you suggest:
I don’t think this makes sense. If the data was gathered by taking humans into rooms, giving them randomly sampled task instructions, a couple input/output examples, and setting them loose, then maybe what you say would be true in the limit in some strict sense. But I don’t think it makes sense here to say the “average human” is captured in the training set. A language model is trained on found text. There is a huge variety of processes that give rise to this text, and it is the task of the system to model those processes. When it is prompted with a task description and I/O examples, it can only produce predictions by making sense of that prompt as found text, implicitly assigning some kind of generative process to it. In some sense, you can view it as doing an extremely smart interpolation between the texts that its seen before.
(Maybe you disagree with this because you think big enough data would include basically all possible such situations. I don’t think that is really right morally speaking, because I think the relative frequency of rarer and trickier situations (or underlying cognitive factors to explain situations, or whatever) in the LM setting will probably drop off precipitously to the point of being only infinitesimally useful for the model to learn certain kinds of nuanced things in training. If you were to control the data distribution to fix this, well then you’re not really doing language modeling but controlled massive multitask learning, and regardless, the following criticisms still apply.)
It is utterly remarkable how powerful this approach is and how well it works for “few-shot learning.” A lot of stuff can be learned from found text. But the approach has limits. It is incumbent on the person writing the prompt to figure out how to write it to coax the desired behavior out of the language model, whether the goal is to produce advice about dealing with bears, or avoid producing answers to nonsense questions. It is very easy to imagine how the language modeling assumption will produce reasonable and cool outputs for lots of tasks, but its output in corner cases might wildly vary depending on precise and unpredictable features of the instructions. The problem of crafting a “representative training set” has simply been replaced by another problem, of crafting “representative prompts.” It is very hard, at least for me, to imagine how the language modeling assumption will lead to the model behaving truly reliably in corner cases — when the standards get higher — without relying on the details of prompt selection in exactly the way that supervised models rely on training set construction. I have no reason to expect the language model to just know what I mean.
Another problem here, though, is we don’t know what we mean either. In the course of, erm, producing economic value, when we hit a case where the instructions are ambiguous, we leverage everything else which was not in the instructions — understanding of broader social goals, ethical standards, systems of accountability to which we’re subject, mental modeling of the rule writer, etc. — to decide how to handle it. GPT-3′s few-shot learning uses the language modeling assumption as a stand-in for this. It’s cool that they align in places — for example, the performance breakdown in the GPT-3 paper for the most part indicates that they align well on factual indexing and recall (which is where much of the gains were). but in general I would not expect them to be aligned, and I would expect the misalignment to be exposed to a greater degree as language models get better and standards get higher. Ultimately the language model serves as a very powerful starting point, but the question of how to reliably align it with an external goal remains open (or involves supervised learning, which leads us back to where we started but with a much beefier model). I find that the conception of GPT-3 as a general-purpose few-shot learner, as opposed to just a very powerful language model, tends to let all of these complexities get silently swept under the rug.
It’s worth dwelling on the “we don’t know what we mean” point a bit more. Because the same point applies to supervised learning; indeed, it’s precisely the reason that we end up developing datasets that we think require human-like skills, but which actually can be gamed and solved without them. So you might think what I’m saying here could be applied to just about any model and any benchmark: that perhaps we can extrapolate performance trends on the benchmark, but reaching the point of saturation indicates less that the underlying problems are solved in a way that generalizes beyond the benchmark, and more that the misalignment between the benchmark and the underlying goal is now laid bare.
And… that is basically my argument. Unless you’re optimizing for the exact thing that you care about, then Moloch will get your babies in the end. The thing is, most benchmarks were never intended for this kind of extrapolation (i.e., beyond the precise scope of the data). And the ones that were — for the most part, slated ambitiously as Grand Challenges of AI — have all been essentially debunked. The remaining reasonable use of a benchmark is to allow empirical comparisons between current systems, thereby facilitating progress in the best way we know how. When a benchmark becomes saturated, then we can examine why, and use the lessons to construct the next benchmark, and general progress is — ideally — steadily made.
Consider the Penn Treebank. This was one of the first big corpora of annotated English text used for NLP, constructed using newswire text from the Wall Street Journal written in the 80s. It centered on English syntax, and a lot of work focused on modeling the Penn Treebank with the idea that if you could get a model to produce correct syntactic analyses, i.e., understand syntax, then it could serve as a solid foundation for higher levels of language understanding, just as we think of it happening in humans. These days, error rates for models on Penn Treebank syntax are below estimated annotation error in gold — i.e., the dataset is solved. But syntax is not solved, in any meaningful sense, and the reasons why extend far beyond the standard problems with shallow heuristics. The thing is, even if we have “accurate” syntax, we have no idea how to use it! All the model learned is a function to output syntax trees; it doesn’t automatically come with all of the other facilities that we think relate to syntactic processing in humans, like how different syntactic analyses might affect the meaning of a sentence. For that, you need more tasks and more models, and the syntax trees are demoted to the level of input features. In the old days, they were part of a pipeline that slowly built up semantic features out of syntactic ones. But nowadays you might as well just use a neural net and forget your linguistics. So what was the point? Well, we made a lot of modeling progress along the way, and learned a lot of lessons about how syntax trees don’t get you lots of stuff you want.
These problems continue today. Some say BERT seems to “understand syntax” — since you can train a model to output syntactic trees with relative accuracy. But then when you look at how BERT-trained models actually make their decisions, they’re often not sensitive to syntax in the right ways. Just being able to predict something in one circumstance — even if it apparently generalizes out-of-domain — does not imply that the model contains a general facility for understanding that thing and using it in the ways that humans do. So this idea that you could have a model that just “generally understands language” and then throw it at some task in the world and expect it to succeed, where the model has no automatic mechanism to align with the world and the expectations on it, seems alien to me. All of our past experience with benchmarks, and all of my intuitions about how human labor works, seem to contradict it.
And that’s because, to reiterate an earlier argument: for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don’t really matter that much, with things like search/retrieval, ranking problems, etc., and ML can play a huge role in software systems, though the semantics of the ML components need to be carefully monitored and controlled to ensure consistent alignment with business goals. But it seems to me this is all basically already happening, and the core ML technology is already there to do a lot of this. Maybe you could argue that there is some kind of glide path from big LMs to transformative AI. But I have a really hard time seeing how benchmarks like SuperGLUE have bearing on it.
“actually it’s quite gameable” = “actually it’s quite easy” ;)
More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about “ordinary conversation” was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. “What does this pronoun refer to?”, “Does this entail or contradict this other hypothesis?”, “What will they say next?”, “Are they happy or sad?”, “Are they asking for a cheeseburger?”.
For some of these questions, my original claim follows trivially. “What does this pronoun refer to?” is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity.
And then I’m making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don’t need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn’t been optimised for that benchmark suite).
You joke, but one of my main points is that these are very, very different things. Any benchmark, or dataset, acts as a proxy for the underlying task that we care about. Turing used natural conversation because it was a domain where a wide range of capabilities are normally used by humans. The problem is that in operationalizing the test (e.g., trying to fool a human), it ends up being possible or easy to pass without necessarily using or requiring all of those capabilities. And this can happen for reasons beyond just overfitting to the data distribution, because the test itself may just not be sensitive enough to capture “human-likeness” beyond a certain threshold (i.e., the noise ceiling).
What I’m saying is I really do not think that’s true. In my experience, at least one of the following holds for pretty much every NLP benchmark out there:
The data is likely artificially easy compared to what would be demanded of a model in real-world settings. (It’s hard to know this for sure for any dataset until the benchmark is beaten by non-robust models; but I basically assume it as a rule of thumb for things that aren’t specifically using adversarial methods.) Most QA and Reading Comprehension datasets fall into this category.
The annotation spec is unclear enough, or the human annotations are noisy enough, that even human performance on the task is at an insufficient reliability level for practical automation tasks which use it as a subroutine, except in cases which are relatively tolerant of incorrect outputs (like information retrieval and QA in search). This is in part because humans do these annotations in isolation, without a practical usage or business context to align their judgments. RTE, WiC, and probably MultiRC and BoolQ fall into this category.
For the datasets with hard examples and high agreement, the task is artificial and basic enough that operationalizing it into something economically useful remains a significant challenge. The WSC, CommitmentBank, BoolQ and COPA datasets fall in this category. (Side note: these actually aren’t event necessarily inherently less noisy or better specified than the datasets in the second category, because often high-agreement is achieved by just filtering out the low-agreement inputs from the data; of course, these low-agreement inputs may be out there in the wild, and the correct output on those may rightly be considered underspecified.)
Possible exceptions to this generally are when the benchmark already very closely corresponds to a business need. Examples of this include Quora Question Pairs (QQP) from GLUE and BoolQ on SuperGLUE. On QQP, models were already superhuman at GLUE’s release time (we hadn’t calculated human performance at that point, oops). BoolQ is also potentially special in that it was actually annotated over search queries, and even with low human agreement, progress on that dataset is probably somewhat representative of progress for extractive QA with yes/no questions in the Google search context (especially because there is a high tolerance for failure in that setting).
I think it would be really cool if we could just automate a bunch of complex tasks like customer service using something like natural language questions of the kind that appear in these benchmarks. In fact, using QA as a proxy for other kinds of structure and tasks is a primary focus of my own research. I think that it might even be possible to make significant headway using this approach. But my sense is that the crucial last 10% of building a robust, usable system that can actually replace most of the value provided by a human in such a role requires a lot of software architecting, knowledge engineering, and domain-specific understanding in order to have a way to reliably align even a very powerful ML system with business goals. This is my understanding based on my own (brief, unsuccessful) work on natural language interfaces as well as discussions with people who work on comparatively successful projects like Google Assistant or Alexa chatbots. I’m not saying that I don’t think these things will get huge productivity boosts from automation very soon — rather, I suspect they will, and my impression is that startups like ASAPP and Cresta are making headway. Are recent big advances in ML a big part of this? Well, I don’t know, but I imagine so. A key enabling technology, even. But the people at these companies are probably not just working on reinforcement learning algorithms. I would suspect that they instead are working on advances of the kind that we see in Semantic Machines’s dataflow graphs, or Tesla Self Driving’s massive software stack.
And yeah, as ML advances, both SuperGLUE performance and productivity growth due to economically useful ML systems will continue. But beyond that, if some kind of phase shift in this productivity growth is expected (I admit I don’t really know what is meant by “transformative AI”), I don’t see a relationship between that and e.g. SuperGLUE performance, any more than performance on some other benchmark like the much older Penn Treebank. Benchmarks are beings of their time; they encode our best attempts at measuring progress, and their saturation serves primarily to guide us in our next steps.
Cool, thanks. I agree that specifying the problem won’t get solved by itself. In particular, I don’t think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven’t been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it’s easier to game more narrow benchmarks, I don’t trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there’s not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn’t always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I’m as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there’s no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like “For each subtask, does it seem easier or harder than the ~solved benchmark tasks?” (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we’re trying to automate is “farther away” from the task of predicting internet corpora, we should adjust for how much finetuning we’ll need to make up for that.
We’ll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me “We’ll reach near-optimal performance on benchmark X with <100 examples in 2022” I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn’t about “benchmarks” in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it’s a convenient list of tasks to look at.
I’m not sure what you’re trying to say here? My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it’s sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
Re: how to update based on benchmark progress in general, see my response to you above.
On the rest, I think the best way I can think of explaining this is in terms of alignment and not correctness.
The bird example is good. My contention is basically that when it comes to making something like “recognizing birds” economically useful, there is an enormous chasm between 90% performance on a subset of ImageNet and money in the bank. For two reasons, among others:
Alignment. What do we mean by “recognize birds”? Do pictures of birds count? Cartoon birds? Do we need to identify individual organisms e.g. for counting birds? Are some kinds of birds excluded?
Engineering. Now that you have a module which can take in an image and output whether it has a bird in it, how do you produce value?
I’ll admit that this might seem easy to do, and that ML is doing pretty much all the heavy lifting here. But my take on that is it’s because object recognition/classification is a very low-level and automatic, sub-cognitive, thing. Once you start getting into questions of scene understanding, or indeed language understanding, there is an explosion of contingencies beyond silly things like cartoon birds. What humans are really really good at is understanding these (often unexpected) contingencies in the context of their job and business’s needs, and acting appropriately. At what point would you be willing to entrust an ML system to deal with entirely unexpected contingencies in a way that suits your business needs (and indeed, doesn’t tank them)? Even the highest level of robustness on known contingencies may not be enough, because almost certainly, the problem is fundamentally underspecified from the instructions and input data. And so, in order to successfully automate the task, you need to successfully characterize the full space of contingencies you want the worker to deal with, perhaps enforcing it by the architecture of your app or business model. And this is where the design, software engineering, and domain-specific understanding aspects come in. Because no matter how powerful our ML systems are, we only want to use them if they’re aligned (or if, say, we have some kind of bound on how pathologically they may behave, or whatever), and knowing that is in general very hard. More powerful ML does make the construction of such systems easier, but is in some way orthogonal to the alignment problem. I would make this more concrete but I’m tired so I hope the concrete examples I gave earlier in the discussion serve as inspiration enough.
And, yeah. I should also clarify that my position is in some way contingent on ML already being good enough to eat all kinds of stuff that 10 years ago would be unheard of. I don’t mean to dunk on ML’s economic value. But basically what I think is that a lot of pretty transformative AI is already here. The market has taken up a lot of it, but I’m sure there’s plenty more slack to be made up in terms of productivity gains from today’s (and yesterday’s) ML. This very well might result in a doubling of worker productivity, which we’ve seen many times before and which seems to meet some definition of “producing the majority of the economic value that a human is capable of.” Maybe if I had a better sense of the vision of “transformative AI” I would be able to see more clearly how ML progress relates to it. But again, even then I don’t necessarily see the connection to extrapolation on benchmarks, which are inherently just measuring sticks of their day and kind of separate from the economic questions.
Anyway, thanks for engaging. I’m probably going to duck out of responding further because of holiday and other duties, but I’ve enjoyed this exchange. It’s been a good opportunity to express, refine, & be challenged on my views. I hope you’ve felt that it’s productive as well.
This has definitely been productive for me. I’ve gained useful information, I see some things more clearly, and I’ve noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!