This comment does not deserve to be downvoted; I think it’s basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don’t seem to reveal anything deep about cognition to me; it’s a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.
Writing styles don’t seem to reveal anything deep about cognition to me; it’s a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.
But isn’t it interesting that the way human linguists thought word/punctuation choice worked in humans failed to produce human-like speech, and yet GPT-2 successfully produces human-like speech? Yes, obviously, it’s the babbler instead of the full brain. But that definitely lines up with my internal experience, where I have some ‘conceptual realm’ that hands concepts off to a babbler, which then generates sentences, in a way that lines up with how GPT-2 seems to operate (where I can confidently start a sentence and not know how I’ll finish it, and then it’s sensible by the time I get there).
That’s not a novel result though. We’ve basically known those aspects of speech to be associative for decades. Indeed it is pretty hard to explain many frequent errors in human speech models without associative generative models. There are some outliers, like Chomsky, who persist in pushing unrealistic models of human speech, but for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked.
Now reducing that assumption to practice is a huge engineering accomplishment which I don’t mean to belittle. But the OP is wondering why linguists are not all infatuated with GPT-2. The answer is that there wasn’t that much to be learned from a theorist perspective. They already assigned >90% probability that GPT-2 models something like how speech production works. So having it reduced to practice isn’t that big of an update, in terms of Bayesian reasoning. It’s just the wheel of progress turning forward.
They already assigned >90% probability that GPT-2 models something like how speech production works.
Is that truly the case? I recall reading Corey Washington a former linguist (who left the field for neuroscience in frustration with its culture and methods) claim that when he was a linguist the general attitude was there was no way in hell something like GPT-2 would ever work even close to the degree that it does.
Found it:
Steve: Corey’s background is in philosophy of language and linguistics, and also neuroscience, and I have always felt that he’s a little bit more pessimistic than I am about AGI. So I’m curious — and answer honestly, Corey, no revisionist thinking — before the results of this GPT-2 paper were available to you, would you not have bet very strongly against the procedure that they went through working?
Corey: Yes, I would’ve said no way in hell actually, to be honest with you.
Steve: Yes. So it’s an event that caused you to update your priors.
Corey: Absolutely. Just to be honest, when I was coming up, I was at MIT in the mid ’80s in linguistics, and there was this general talk about how machine translation just would never happen and how it was just lunacy, and maybe if they listened to us at MIT and took a little linguistics class they might actually figure out how to get this thing to work, but as it is they’re going off and doing this stuff which is just destined to fail. It’s a complete falsification of that basic outlook, which I think, — looking back, of course — had very little evidence — it had a lot of hubris behind it, but very little evidence behind it.
I was just recently reading a paper in Dutch, and I just simply… First of all, the OCR recognized the Dutch language and it gave me a little text version of the page. I simply copied the page, pasted it into Google Translate, and got a translation that allowed me to basically read this article without much difficulty. That would’ve been thought to be impossible 20, 30 years ago — and it’s not even close to predicting the next word, or writing in the style that is typical of the corpus.
Do you know any promising theories of the higher levels of speech production (i.e., human verbal/symbolic reasoning)? That seems to me to be one of the biggest missing pieces at this point of a theoretical understanding of human intelligence (and of AGI theory), and I wonder if there’s actually good theoretical work out there that I’m just not aware of.
for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked
Can you be more specific about what you mean by “something like the Transformer model”? Or is there a reference you recommend? I don’t think anyone believes that there are literally neurons in the brain wired up into a Transformer, or anything like that, right?
As far as I’m aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word “goose” makes us more likely to think of the word “moose” as well as “duck” (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.
The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase “word choice.”
If “word choice” just means something narrow like “selecting which noun you want to use, given that you are picking the inhabitant of a ‘slot’ in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey,” then perhaps priming and other results about perceptions of “word similarity” might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I’m aware) believe GPT-2 works like this.
If “word choice” means something bigger that encompasses syntax, then priming experiments about single words don’t tell us much about it.
I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2′s stylistic fluency is less surprising than its syntactic fluency. In fact, I think that’s true—intellectually, I am more impressed by the syntax than the style.
But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam’s Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.
It’s more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena—as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style “for free” out of a model that also does good syntax (or, if you prefer, good syntax “for free” out of a model that also does good style) suggests we might be scientifically on the right track.
Neither one is surprising to me at all. In fact I don’t think there is a sharp divide between syntax and style—syntax is that word which we assign to culturally shared style. That’s why we can define specialized syntaxes for dialectal differences. And as a structural rule, syntax/style is very relevant to word choice since it prohibits certain combinations. A big enough network will have a large enough working memory to “keep in mind” enough contextual information to effectively satisfy the syntax rules describing the styles it learned.
This comment does not deserve to be downvoted; I think it’s basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don’t seem to reveal anything deep about cognition to me; it’s a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.
But isn’t it interesting that the way human linguists thought word/punctuation choice worked in humans failed to produce human-like speech, and yet GPT-2 successfully produces human-like speech? Yes, obviously, it’s the babbler instead of the full brain. But that definitely lines up with my internal experience, where I have some ‘conceptual realm’ that hands concepts off to a babbler, which then generates sentences, in a way that lines up with how GPT-2 seems to operate (where I can confidently start a sentence and not know how I’ll finish it, and then it’s sensible by the time I get there).
That’s not a novel result though. We’ve basically known those aspects of speech to be associative for decades. Indeed it is pretty hard to explain many frequent errors in human speech models without associative generative models. There are some outliers, like Chomsky, who persist in pushing unrealistic models of human speech, but for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked.
Now reducing that assumption to practice is a huge engineering accomplishment which I don’t mean to belittle. But the OP is wondering why linguists are not all infatuated with GPT-2. The answer is that there wasn’t that much to be learned from a theorist perspective. They already assigned >90% probability that GPT-2 models something like how speech production works. So having it reduced to practice isn’t that big of an update, in terms of Bayesian reasoning. It’s just the wheel of progress turning forward.
Is that truly the case? I recall reading Corey Washington a former linguist (who left the field for neuroscience in frustration with its culture and methods) claim that when he was a linguist the general attitude was there was no way in hell something like GPT-2 would ever work even close to the degree that it does.
Found it:
Do you know any promising theories of the higher levels of speech production (i.e., human verbal/symbolic reasoning)? That seems to me to be one of the biggest missing pieces at this point of a theoretical understanding of human intelligence (and of AGI theory), and I wonder if there’s actually good theoretical work out there that I’m just not aware of.
Can you be more specific about what you mean by “something like the Transformer model”? Or is there a reference you recommend? I don’t think anyone believes that there are literally neurons in the brain wired up into a Transformer, or anything like that, right?
As far as I’m aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word “goose” makes us more likely to think of the word “moose” as well as “duck” (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.
The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase “word choice.”
If “word choice” just means something narrow like “selecting which noun you want to use, given that you are picking the inhabitant of a ‘slot’ in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey,” then perhaps priming and other results about perceptions of “word similarity” might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I’m aware) believe GPT-2 works like this.
If “word choice” means something bigger that encompasses syntax, then priming experiments about single words don’t tell us much about it.
I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2′s stylistic fluency is less surprising than its syntactic fluency. In fact, I think that’s true—intellectually, I am more impressed by the syntax than the style.
But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam’s Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.
It’s more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena—as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style “for free” out of a model that also does good syntax (or, if you prefer, good syntax “for free” out of a model that also does good style) suggests we might be scientifically on the right track.
Neither one is surprising to me at all. In fact I don’t think there is a sharp divide between syntax and style—syntax is that word which we assign to culturally shared style. That’s why we can define specialized syntaxes for dialectal differences. And as a structural rule, syntax/style is very relevant to word choice since it prohibits certain combinations. A big enough network will have a large enough working memory to “keep in mind” enough contextual information to effectively satisfy the syntax rules describing the styles it learned.