‘Stochastic parrots’ 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do. Likewise, their ‘superintelligent octopus’ example of eavesdropping on (chess, IIRC) game transcripts is the claim that imitation or offline RL for chess is impossible. Lack of falsifiable claims was not the problem with the claims made by eg. Gary Marcus.
The problem is that those claims have generally all been falsified, quite rapidly: the original prompts were entirely soluble by LLMs back in 2020, and it is difficult to accept the octopus claims in the light of results like https://arxiv.org/abs/2402.04494#deepmind . (Which is probably why you no longer hear much about the specific falsifiable claims made by the stochastic parrots paper, even by people still citing it favorably.) But then the goalposts moved.
‘Stochastic parrots’ 2020 actually does make many falsifiable claims. [...] The problem is that those claims have generally all been falsified, quite rapidly.
The paper seems quite wrong to me, but I actually don’t think any of the specific claims have been falsified other than the the specific “three plus five” claim in the appendix.
The specific claims:
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults. Doesn’t seem clear that this has been falsified depending on how you define “trivial notes” (which is key). (Let’s suppose these notes don’t involve any device construction?) Separately, it’s not as though human children would generalize...
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
If you train an model on text and images separately, it won’t generalize to answering text questions about images. (Seems clearly true to me, but also uninteresting? More interesting would be you train on text and images, then just train to answer questions about dogs and see if it generalizes to cats. I think this could work with current models and is likely to work if you expand the question training set to be more general (but still to exclude cats.). (E.g. GPT-4 clearly can generalize to identifying novel objects which are described.).)
An LLM will never be able to answer “three plus five equals”. Clearly falsified and obvious so. Likely they intended additional caveats about the training data??? (Otherwise memorization clearly works...)
For each of these specific cases, it seems pretty silly because clearly you can just train your LLM on a wide variety of stuff. (Similar to humans.) Also, I think you can train humans on purely text and do perfectly fine… (Though I’m not aware of clear experiments here because even blind and deaf people have touch. You’d want to take a blind+deaf person and then only acquire semantics via braille.)
I think you can do experiments which very compellingly argue against this paper, but I don’t really see specific claims being falsified.
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults.
I don’t believe they say “just”. They describe the two humans as talking about lots of things, including but not limited to daily gossip: https://aclanthology.org/2020.acl-main.463.pdf#page=4 The ‘trivial notes’ part is simply acknowledging that in very densely-sampled ‘simple’ areas of text (like the sort of trivial notes one might pass back and forth in SMS chat), the superintelligent octopus may well succeed in producing totally convincing text samples. But if you continue on to the next page, you see that they continue giving hostages to fortune—for example, their claims about ‘rope’/‘coconut’/‘nail’ are falsified by the entire research area of vision-language models like Flamingo, as well as reusing frozen LLMs for control like Saycan. Turns out text-only LLMs already have plenty of visual grounding hidden in them, and their textual latent spaces align already to far above chance levels. So much for that.
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
It’s not because the bear example is again like the coconut catapult—the cast-away islanders are not being chased around by bears constantly and exchanging ‘trivial notes’ about how to deal with bear attacks! Their point is that this is a sort of causal model and novel utterance a mere imitation of ‘form’ cannot grant any ‘understanding’ of. (As it happens, they are embarrassingly wrong here, because their bear example is not even wrong. They do not give what they think would be the ‘right’ answer, but whatever answer they gave, it would be wrong—because you are actually supposed to do the exact opposite things for the two major kinds of bears you would be attacked by in North America. Therefore, there is no answer to the question of how to use sticks when ‘a bear’ chases you. IIRC, if you check bear attack safety guidelines, the actual answer is that if one type attacks you, you should use the sticks to try to defend yourself and appear bigger; but if the other type attacks you, this is the worst thing you can possibly do and you need to instead play dead. And if you fix their question to specify the bear type so there is a correct answer, then the LLMs get it right.) You can gauge the robustness & non-falsification of their examples by noting that after I rebutted them back in 2020, they refused to respond, dropped those examples silently without explanation from their later papers, and started calling me an eugenicist.
If you train an model on text and images separately, it won’t generalize to answering questions about both images. (Seems clearly true to me
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
I don’t know if there’s anything on this exact scenario, but I wouldn’t be surprised if it could ‘generalize’. Although you would need to nail this down a lot more precisely to avoid them wriggling out of it: does this include stripping out all comments, which will often include input/output examples? Is pretraining on natural language text forbidden? What exactly is a ‘LLM’ and does this rule out all offline RL or model-based RL approaches which try to simulate environments? etc.
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”
‘Stochastic parrots’ 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do.
I’ll confess I skipped parts of it (eg the section on environmental costs) when rereading it before posting the above, but that paper doesn’t contain ‘octopus’ or ‘game’ or ‘transcript’, and I’m not seeing claims about specific prompts.
OK, yeah, Bender & Koller is much more bullet-biting, up to and including denying that any understanding happens anywhere in a Chinese Room. In particular they argue that completing “three plus five equals” is beyond the ability of any pure LM, which is pretty wince-inducing in retrospect.
I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified. [Update: Bender seems to have clearly held the same positions as of September 23, based on the slides from this talk.]
I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified
AFAICT, the only falsified claim in the paper is the “three plus five equals” claim you mentioned. This is in this appendix and doesn’t seem that clear to me what they mean by “pure LLM”. (Like surely they agree that you can memorize this?)
The other claims are relatively weak and not falsified. See here
‘Stochastic parrots’ 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do. Likewise, their ‘superintelligent octopus’ example of eavesdropping on (chess, IIRC) game transcripts is the claim that imitation or offline RL for chess is impossible. Lack of falsifiable claims was not the problem with the claims made by eg. Gary Marcus.
The problem is that those claims have generally all been falsified, quite rapidly: the original prompts were entirely soluble by LLMs back in 2020, and it is difficult to accept the octopus claims in the light of results like https://arxiv.org/abs/2402.04494#deepmind . (Which is probably why you no longer hear much about the specific falsifiable claims made by the stochastic parrots paper, even by people still citing it favorably.) But then the goalposts moved.
The paper seems quite wrong to me, but I actually don’t think any of the specific claims have been falsified other than the the specific “three plus five” claim in the appendix.
The specific claims:
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults. Doesn’t seem clear that this has been falsified depending on how you define “trivial notes” (which is key). (Let’s suppose these notes don’t involve any device construction?) Separately, it’s not as though human children would generalize...
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
If you train an model on text and images separately, it won’t generalize to answering text questions about images. (Seems clearly true to me, but also uninteresting? More interesting would be you train on text and images, then just train to answer questions about dogs and see if it generalizes to cats. I think this could work with current models and is likely to work if you expand the question training set to be more general (but still to exclude cats.). (E.g. GPT-4 clearly can generalize to identifying novel objects which are described.).)
An LLM will never be able to answer “three plus five equals”. Clearly falsified and obvious so. Likely they intended additional caveats about the training data??? (Otherwise memorization clearly works...)
For each of these specific cases, it seems pretty silly because clearly you can just train your LLM on a wide variety of stuff. (Similar to humans.) Also, I think you can train humans on purely text and do perfectly fine… (Though I’m not aware of clear experiments here because even blind and deaf people have touch. You’d want to take a blind+deaf person and then only acquire semantics via braille.)
I think you can do experiments which very compellingly argue against this paper, but I don’t really see specific claims being falsified.
I don’t believe they say “just”. They describe the two humans as talking about lots of things, including but not limited to daily gossip: https://aclanthology.org/2020.acl-main.463.pdf#page=4 The ‘trivial notes’ part is simply acknowledging that in very densely-sampled ‘simple’ areas of text (like the sort of trivial notes one might pass back and forth in SMS chat), the superintelligent octopus may well succeed in producing totally convincing text samples. But if you continue on to the next page, you see that they continue giving hostages to fortune—for example, their claims about ‘rope’/‘coconut’/‘nail’ are falsified by the entire research area of vision-language models like Flamingo, as well as reusing frozen LLMs for control like Saycan. Turns out text-only LLMs already have plenty of visual grounding hidden in them, and their textual latent spaces align already to far above chance levels. So much for that.
It’s not because the bear example is again like the coconut catapult—the cast-away islanders are not being chased around by bears constantly and exchanging ‘trivial notes’ about how to deal with bear attacks! Their point is that this is a sort of causal model and novel utterance a mere imitation of ‘form’ cannot grant any ‘understanding’ of. (As it happens, they are embarrassingly wrong here, because their bear example is not even wrong. They do not give what they think would be the ‘right’ answer, but whatever answer they gave, it would be wrong—because you are actually supposed to do the exact opposite things for the two major kinds of bears you would be attacked by in North America. Therefore, there is no answer to the question of how to use sticks when ‘a bear’ chases you. IIRC, if you check bear attack safety guidelines, the actual answer is that if one type attacks you, you should use the sticks to try to defend yourself and appear bigger; but if the other type attacks you, this is the worst thing you can possibly do and you need to instead play dead. And if you fix their question to specify the bear type so there is a correct answer, then the LLMs get it right.) You can gauge the robustness & non-falsification of their examples by noting that after I rebutted them back in 2020, they refused to respond, dropped those examples silently without explanation from their later papers, and started calling me an eugenicist.
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
I don’t know if there’s anything on this exact scenario, but I wouldn’t be surprised if it could ‘generalize’. Although you would need to nail this down a lot more precisely to avoid them wriggling out of it: does this include stripping out all comments, which will often include input/output examples? Is pretraining on natural language text forbidden? What exactly is a ‘LLM’ and does this rule out all offline RL or model-based RL approaches which try to simulate environments? etc.
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”
The Bender et al paper? “On the Dangers of Stochastic Parrots”? Other sources like Wikipedia cite that paper as the origin of the term.
I’ll confess I skipped parts of it (eg the section on environmental costs) when rereading it before posting the above, but that paper doesn’t contain ‘octopus’ or ‘game’ or ‘transcript’, and I’m not seeing claims about specific prompts.
Oh, no, I see, I think you’re referring to Bender and Koller, “Climbing Toward NLU”? I haven’t read that one, I’ll
readskim it now.OK, yeah, Bender & Koller is much more bullet-biting, up to and including denying that any understanding happens anywhere in a Chinese Room. In particular they argue that completing “three plus five equals” is beyond the ability of any pure LM, which is pretty wince-inducing in retrospect.
I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified. [Update: Bender seems to have clearly held the same positions as of September 23, based on the slides from this talk.]
AFAICT, the only falsified claim in the paper is the “three plus five equals” claim you mentioned. This is in this appendix and doesn’t seem that clear to me what they mean by “pure LLM”. (Like surely they agree that you can memorize this?)
The other claims are relatively weak and not falsified. See here