I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I’m going to do that and then follow it with a higher-level, more conceptual response.
The source of a lot of my skepticism is GPT-3′s inherent inconsistency. It can range wildly from it’s high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn’t expect such inconsistency. Even when it is performing so well people call it “reasoning” it has enough artifacts of it’s “non-reasoning” output to make me skeptical (logical contradictions, it’s tendency to repeat itself i.e. “Because Gravity Duh” like in the OP, etc).
So, GPT-3′s architecture involves randomly sampling. The model produces a distribution, a list of words ranked by likelihood, and then the sampling algorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model itself, the way the distribution is produced, and the sampling method are all distinct things. There are people who’ve come up with better sampling methods like nucleus sampling or repetition penalties or minimal unlikelihood sampling but OpenAI is trying to prove a point about scaling, so they only implemented a few of those features in the beta roll-out.
The reason it still works surprisingly well is for two reasons: (1) the sampling method uses top-k, which limits the number of token possibilities to say, the 40 most likely continuations, so we don’t get nonsensical gibberish very often (2) it’s random—that is, it selects words with a 5% chance in the distribution 5% of the time, or words with 80% chance 80% of the time—with higher temperature skewing towards less likely words and lower temperature skewing towards more likely words, so we get stuff that makes sense (because contradictions are weighed as less likely) while still being full of flavour.
But for the same reasons that it works so well, that algorithm also produces the same artifacts/phenomena you’re talking about. “Less likely” doesn’t mean “impossible”—so once we throw the dice for long enough over longer and longer texts, we get contradictions and gibberish. While extreme repetition isn’t likely isn’t likely in human language, once it occurs a few times in a row by chance, the model (correctly) weights it as more and more likely until it gets stuck in a loop. And even after all of that, the model itself is trained on CommonCrawl which contains a lot of contradiction and nonsense. If I asked someone to listen to six hundred hours of children’s piano recitals, prompted them with a D flat note, and told them to accurately mimic the distribution of skill they heard in the recitals – sometimes they would give me an amazing performance since there would be a few highly-skilled or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atrocious. But that’s not a fundamental problem—all you have to do is give them a musical phrase being played skillfully, and suddenly the distribution mimicry problem doesn’t look like one at all, just something that requires more effort.
When the underlying architecture becomes clear, you really need to go into the finer details of what it means to be “capable” of reasoning. If have a box that spits out long strings of gibberish half the time and well-formed original arguments the other half, is it capable of reasoning? What if the other half is only ten percent of the time? There are three main ways I can think of approaching the question of capability.
In the practical and functional sense, in situations where reliability matters: if I have a ‘driverless car’ which selects actions like steering and braking from a random distribution when travelling to a destination, and as a result crashes into storefronts or takes me into the ocean, I would not call that “capable of driving autonomously”. From this perspective, GPT-3 with top-k sampling is not capable of reliably reasoning as it stands. But if it turned out that there was a road model producing the distribution, and that it turned out that actually the road model was really good but the sampling method was bad, and that all I needed was a better sampling method… Likewise, with GPT-3, if you were looking directly at the distribution, and only cared about it generating 10-20 words at a time, it would be very easy to make it perform reasoning tasks. But for other tasks? Top-k isn’t amazing, but the other ones aren’t much better. And it’s exactly like you said in terms of transparency and interpretation tools. We don’t know where to start, whether there’s even a one-size-fits all solution, or what the upper limits are of the useful information we could extract from the underlying model. (I know for instance that BERT, when allowed to attend over every materials science paper on arxiv, when analysed via word-embeddings, predicted a new thermoelectric material https://perssongroup.lbl.gov/papers/dagdelen-2019-word-embeddings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word capable.
In the literal sense: if GPT-3 can demonstrate reasoning once (we already know it can handle Boolean logic, maths, deductive, inductive, analogical, etc. word-problems), then it’s “capable” of reasoning.
In the probabilistic sense: language has a huge probability-space. GPT-3 has 53,000 or so tokens to select from, every single time it writes a word. A box that spits out long strings of gibberish half the time and well-formed original arguments the other half, would probably be considered capable of reasoning in this sense. The possibility space for language is huge. “Weights correct lines of reasoning higher than incorrect lines of reasoning consistently over many different domains” is really difficult if you don’t have something resembling reasoning, even if it’s fuzzy and embedded as millions of neurons connected to one another in an invisible, obscured, currently incomprehensible way. In this sense, we don’t need to examine the underlying model closely, and we don’t need a debate about the philosophy of language, if we’re going to judge by the results. And the thing is, we already know GPT-3 does this, despite being hampered by sampling.
Now, there’s the final point I want to make architecture-wise. I’ve seen this brought up a lot in this thread: what if the CommonCrawl dataset has a question asking about clouds becoming lead, or a boy who catches on fire if he turns five degrees left? The issue is that even if those examples existed (I was only able to find something very vaguely related to the cloud-lead question on Stack Exchange’s worldbuilding forum), GPT-3, though it can do better than its predecessor, can’t memorise or remember all of its training dataset. In a way, that’s the entire point—compression is learning, having a good representation of a dataset means being able to compress and decompress it more accurately and to a greater extent, if you had a model that just memorised everything, it wouldn’t be able to any of the things we’ve seen it do. This is an issue of anthropomorphising: GPT-3 doesn’t “read”, it passes over 570GB of raw text and updates its weights incrementally with each word it passes over. The appearance of single question asking about clouds turning into lead isn’t a drop in the bucket, proportionally, it’s a drop in the ocean. If a poem appears 600 times, that’s another story. But right now the “what if it was on the internet, somewhere?” thing doesn’t really make any sense, and every time we give GPT-3 another, even more absurd and specific problem, it makes even less sense given that there’s an alternative hypothesis which is much simpler – that a 175 billion parameter transformer trained at the cost of $6.5m on most of the whole internet, in order to model sequences of text as accurately as possible, also needed to develop a rudimentary model of the logical reasoning, concepts, and causes and effects that went into those sequences of text.
So I’ve done the low-level technical response (which might sum up to: “in the literal and probabilistic senses, and kind of in the practical sense, GPT-3 has been able to perform reasoning on everything we’ve thrown at it so far) and pretty much emptied out my head, so here’s what’s left:
With regards to the original question I posed, I guess the natural response is to just balk at the idea of answering it – but the point isn’t really to answer it. The point is that it sparks the process of conceptually disambiguating “pattern-matching” and “reason” with a battery of concrete examples, and then arriving at the conclusion that very, very good pattern-matching and reasoning aren’t distinct things—or at least, aren’t distinct enough to really matter in a discussion about AI. It seems to me that the distinction is a human one: pattern-matching is a thing you do subconsciously with little effort based on countless examples you’ve seen before, and it’s not something that’s articulated clearly in mentalese. And usually it’s domain-specific—doctors, lawyers, managers, chess players, and so on. Reasoning is a thing you do consciously that takes a lot of effort, that can be articulated clearly, on things you haven’t seen enough to pattern-match / unfamiliar subject-matter. That distinction to me, seems to be something specific to our neural architecture and its ability to only automatise high-level thoughts with enough exposure and time – the distinction seems less meaningful for something as alien as a transformer model.
One thing I find impressive about GPT-3 is that it’s not even trying to generate text.
Imagine that someone gave you a snippet of random internet text, and told you to predict the next word. You give a probability distribution over possible next words. The end.
Then, your twin brother gets a snippet of random internet text, and is told to predict the next word. Etc. Unbeknownst to either of you, the text your brother gets is the text you got, with a new word added to it according to the probability distribution you predicted.
Then we repeat with your triplet brother, then your quadruplet brother, and so on.
Is it any wonder that sometimes the result doesn’t make sense? All it takes for the chain of words to get derailed is for one unlucky word to be drawn from someone’s distribution of next-word prediction. GPT-3 doesn’t have the ability to “undo” words it has written; it can’t even tell its future self what its past self had in mind when it “wrote” a word!
I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I’m going to do that and then follow it with a higher-level, more conceptual response.
So, GPT-3′s architecture involves randomly sampling. The model produces a distribution, a list of words ranked by likelihood, and then the sampling algorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model itself, the way the distribution is produced, and the sampling method are all distinct things. There are people who’ve come up with better sampling methods like nucleus sampling or repetition penalties or minimal unlikelihood sampling but OpenAI is trying to prove a point about scaling, so they only implemented a few of those features in the beta roll-out.
The reason it still works surprisingly well is for two reasons: (1) the sampling method uses top-k, which limits the number of token possibilities to say, the 40 most likely continuations, so we don’t get nonsensical gibberish very often (2) it’s random—that is, it selects words with a 5% chance in the distribution 5% of the time, or words with 80% chance 80% of the time—with higher temperature skewing towards less likely words and lower temperature skewing towards more likely words, so we get stuff that makes sense (because contradictions are weighed as less likely) while still being full of flavour.
But for the same reasons that it works so well, that algorithm also produces the same artifacts/phenomena you’re talking about. “Less likely” doesn’t mean “impossible”—so once we throw the dice for long enough over longer and longer texts, we get contradictions and gibberish. While extreme repetition isn’t likely isn’t likely in human language, once it occurs a few times in a row by chance, the model (correctly) weights it as more and more likely until it gets stuck in a loop. And even after all of that, the model itself is trained on CommonCrawl which contains a lot of contradiction and nonsense. If I asked someone to listen to six hundred hours of children’s piano recitals, prompted them with a D flat note, and told them to accurately mimic the distribution of skill they heard in the recitals – sometimes they would give me an amazing performance since there would be a few highly-skilled or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atrocious. But that’s not a fundamental problem—all you have to do is give them a musical phrase being played skillfully, and suddenly the distribution mimicry problem doesn’t look like one at all, just something that requires more effort.
When the underlying architecture becomes clear, you really need to go into the finer details of what it means to be “capable” of reasoning. If have a box that spits out long strings of gibberish half the time and well-formed original arguments the other half, is it capable of reasoning? What if the other half is only ten percent of the time? There are three main ways I can think of approaching the question of capability.
In the practical and functional sense, in situations where reliability matters: if I have a ‘driverless car’ which selects actions like steering and braking from a random distribution when travelling to a destination, and as a result crashes into storefronts or takes me into the ocean, I would not call that “capable of driving autonomously”. From this perspective, GPT-3 with top-k sampling is not capable of reliably reasoning as it stands. But if it turned out that there was a road model producing the distribution, and that it turned out that actually the road model was really good but the sampling method was bad, and that all I needed was a better sampling method… Likewise, with GPT-3, if you were looking directly at the distribution, and only cared about it generating 10-20 words at a time, it would be very easy to make it perform reasoning tasks. But for other tasks? Top-k isn’t amazing, but the other ones aren’t much better. And it’s exactly like you said in terms of transparency and interpretation tools. We don’t know where to start, whether there’s even a one-size-fits all solution, or what the upper limits are of the useful information we could extract from the underlying model. (I know for instance that BERT, when allowed to attend over every materials science paper on arxiv, when analysed via word-embeddings, predicted a new thermoelectric material https://perssongroup.lbl.gov/papers/dagdelen-2019-word-embeddings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word capable.
In the literal sense: if GPT-3 can demonstrate reasoning once (we already know it can handle Boolean logic, maths, deductive, inductive, analogical, etc. word-problems), then it’s “capable” of reasoning.
In the probabilistic sense: language has a huge probability-space. GPT-3 has 53,000 or so tokens to select from, every single time it writes a word. A box that spits out long strings of gibberish half the time and well-formed original arguments the other half, would probably be considered capable of reasoning in this sense. The possibility space for language is huge. “Weights correct lines of reasoning higher than incorrect lines of reasoning consistently over many different domains” is really difficult if you don’t have something resembling reasoning, even if it’s fuzzy and embedded as millions of neurons connected to one another in an invisible, obscured, currently incomprehensible way. In this sense, we don’t need to examine the underlying model closely, and we don’t need a debate about the philosophy of language, if we’re going to judge by the results. And the thing is, we already know GPT-3 does this, despite being hampered by sampling.
Now, there’s the final point I want to make architecture-wise. I’ve seen this brought up a lot in this thread: what if the CommonCrawl dataset has a question asking about clouds becoming lead, or a boy who catches on fire if he turns five degrees left? The issue is that even if those examples existed (I was only able to find something very vaguely related to the cloud-lead question on Stack Exchange’s worldbuilding forum), GPT-3, though it can do better than its predecessor, can’t memorise or remember all of its training dataset. In a way, that’s the entire point—compression is learning, having a good representation of a dataset means being able to compress and decompress it more accurately and to a greater extent, if you had a model that just memorised everything, it wouldn’t be able to any of the things we’ve seen it do. This is an issue of anthropomorphising: GPT-3 doesn’t “read”, it passes over 570GB of raw text and updates its weights incrementally with each word it passes over. The appearance of single question asking about clouds turning into lead isn’t a drop in the bucket, proportionally, it’s a drop in the ocean. If a poem appears 600 times, that’s another story. But right now the “what if it was on the internet, somewhere?” thing doesn’t really make any sense, and every time we give GPT-3 another, even more absurd and specific problem, it makes even less sense given that there’s an alternative hypothesis which is much simpler – that a 175 billion parameter transformer trained at the cost of $6.5m on most of the whole internet, in order to model sequences of text as accurately as possible, also needed to develop a rudimentary model of the logical reasoning, concepts, and causes and effects that went into those sequences of text.
So I’ve done the low-level technical response (which might sum up to: “in the literal and probabilistic senses, and kind of in the practical sense, GPT-3 has been able to perform reasoning on everything we’ve thrown at it so far) and pretty much emptied out my head, so here’s what’s left:
With regards to the original question I posed, I guess the natural response is to just balk at the idea of answering it – but the point isn’t really to answer it. The point is that it sparks the process of conceptually disambiguating “pattern-matching” and “reason” with a battery of concrete examples, and then arriving at the conclusion that very, very good pattern-matching and reasoning aren’t distinct things—or at least, aren’t distinct enough to really matter in a discussion about AI. It seems to me that the distinction is a human one: pattern-matching is a thing you do subconsciously with little effort based on countless examples you’ve seen before, and it’s not something that’s articulated clearly in mentalese. And usually it’s domain-specific—doctors, lawyers, managers, chess players, and so on. Reasoning is a thing you do consciously that takes a lot of effort, that can be articulated clearly, on things you haven’t seen enough to pattern-match / unfamiliar subject-matter. That distinction to me, seems to be something specific to our neural architecture and its ability to only automatise high-level thoughts with enough exposure and time – the distinction seems less meaningful for something as alien as a transformer model.
One thing I find impressive about GPT-3 is that it’s not even trying to generate text.
Imagine that someone gave you a snippet of random internet text, and told you to predict the next word. You give a probability distribution over possible next words. The end.
Then, your twin brother gets a snippet of random internet text, and is told to predict the next word. Etc. Unbeknownst to either of you, the text your brother gets is the text you got, with a new word added to it according to the probability distribution you predicted.
Then we repeat with your triplet brother, then your quadruplet brother, and so on.
Is it any wonder that sometimes the result doesn’t make sense? All it takes for the chain of words to get derailed is for one unlucky word to be drawn from someone’s distribution of next-word prediction. GPT-3 doesn’t have the ability to “undo” words it has written; it can’t even tell its future self what its past self had in mind when it “wrote” a word!