‘Stochastic parrots’ 2020 actually does make many falsifiable claims. [...] The problem is that those claims have generally all been falsified, quite rapidly.
The paper seems quite wrong to me, but I actually don’t think any of the specific claims have been falsified other than the the specific “three plus five” claim in the appendix.
The specific claims:
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults. Doesn’t seem clear that this has been falsified depending on how you define “trivial notes” (which is key). (Let’s suppose these notes don’t involve any device construction?) Separately, it’s not as though human children would generalize...
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
If you train an model on text and images separately, it won’t generalize to answering text questions about images. (Seems clearly true to me, but also uninteresting? More interesting would be you train on text and images, then just train to answer questions about dogs and see if it generalizes to cats. I think this could work with current models and is likely to work if you expand the question training set to be more general (but still to exclude cats.). (E.g. GPT-4 clearly can generalize to identifying novel objects which are described.).)
An LLM will never be able to answer “three plus five equals”. Clearly falsified and obvious so. Likely they intended additional caveats about the training data??? (Otherwise memorization clearly works...)
For each of these specific cases, it seems pretty silly because clearly you can just train your LLM on a wide variety of stuff. (Similar to humans.) Also, I think you can train humans on purely text and do perfectly fine… (Though I’m not aware of clear experiments here because even blind and deaf people have touch. You’d want to take a blind+deaf person and then only acquire semantics via braille.)
I think you can do experiments which very compellingly argue against this paper, but I don’t really see specific claims being falsified.
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults.
I don’t believe they say “just”. They describe the two humans as talking about lots of things, including but not limited to daily gossip: https://aclanthology.org/2020.acl-main.463.pdf#page=4 The ‘trivial notes’ part is simply acknowledging that in very densely-sampled ‘simple’ areas of text (like the sort of trivial notes one might pass back and forth in SMS chat), the superintelligent octopus may well succeed in producing totally convincing text samples. But if you continue on to the next page, you see that they continue giving hostages to fortune—for example, their claims about ‘rope’/‘coconut’/‘nail’ are falsified by the entire research area of vision-language models like Flamingo, as well as reusing frozen LLMs for control like Saycan. Turns out text-only LLMs already have plenty of visual grounding hidden in them, and their textual latent spaces align already to far above chance levels. So much for that.
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
It’s not because the bear example is again like the coconut catapult—the cast-away islanders are not being chased around by bears constantly and exchanging ‘trivial notes’ about how to deal with bear attacks! Their point is that this is a sort of causal model and novel utterance a mere imitation of ‘form’ cannot grant any ‘understanding’ of. (As it happens, they are embarrassingly wrong here, because their bear example is not even wrong. They do not give what they think would be the ‘right’ answer, but whatever answer they gave, it would be wrong—because you are actually supposed to do the exact opposite things for the two major kinds of bears you would be attacked by in North America. Therefore, there is no answer to the question of how to use sticks when ‘a bear’ chases you. IIRC, if you check bear attack safety guidelines, the actual answer is that if one type attacks you, you should use the sticks to try to defend yourself and appear bigger; but if the other type attacks you, this is the worst thing you can possibly do and you need to instead play dead. And if you fix their question to specify the bear type so there is a correct answer, then the LLMs get it right.) You can gauge the robustness & non-falsification of their examples by noting that after I rebutted them back in 2020, they refused to respond, dropped those examples silently without explanation from their later papers, and started calling me an eugenicist.
If you train an model on text and images separately, it won’t generalize to answering questions about both images. (Seems clearly true to me
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
I don’t know if there’s anything on this exact scenario, but I wouldn’t be surprised if it could ‘generalize’. Although you would need to nail this down a lot more precisely to avoid them wriggling out of it: does this include stripping out all comments, which will often include input/output examples? Is pretraining on natural language text forbidden? What exactly is a ‘LLM’ and does this rule out all offline RL or model-based RL approaches which try to simulate environments? etc.
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”
The paper seems quite wrong to me, but I actually don’t think any of the specific claims have been falsified other than the the specific “three plus five” claim in the appendix.
The specific claims:
An octopus trained on just “trivial notes” wouldn’t be able to generalize to thoughts on coconut catapults. Doesn’t seem clear that this has been falsified depending on how you define “trivial notes” (which is key). (Let’s suppose these notes don’t involve any device construction?) Separately, it’s not as though human children would generalize...
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won’t generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
If you train an model on text and images separately, it won’t generalize to answering text questions about images. (Seems clearly true to me, but also uninteresting? More interesting would be you train on text and images, then just train to answer questions about dogs and see if it generalizes to cats. I think this could work with current models and is likely to work if you expand the question training set to be more general (but still to exclude cats.). (E.g. GPT-4 clearly can generalize to identifying novel objects which are described.).)
An LLM will never be able to answer “three plus five equals”. Clearly falsified and obvious so. Likely they intended additional caveats about the training data??? (Otherwise memorization clearly works...)
For each of these specific cases, it seems pretty silly because clearly you can just train your LLM on a wide variety of stuff. (Similar to humans.) Also, I think you can train humans on purely text and do perfectly fine… (Though I’m not aware of clear experiments here because even blind and deaf people have touch. You’d want to take a blind+deaf person and then only acquire semantics via braille.)
I think you can do experiments which very compellingly argue against this paper, but I don’t really see specific claims being falsified.
I don’t believe they say “just”. They describe the two humans as talking about lots of things, including but not limited to daily gossip: https://aclanthology.org/2020.acl-main.463.pdf#page=4 The ‘trivial notes’ part is simply acknowledging that in very densely-sampled ‘simple’ areas of text (like the sort of trivial notes one might pass back and forth in SMS chat), the superintelligent octopus may well succeed in producing totally convincing text samples. But if you continue on to the next page, you see that they continue giving hostages to fortune—for example, their claims about ‘rope’/‘coconut’/‘nail’ are falsified by the entire research area of vision-language models like Flamingo, as well as reusing frozen LLMs for control like Saycan. Turns out text-only LLMs already have plenty of visual grounding hidden in them, and their textual latent spaces align already to far above chance levels. So much for that.
It’s not because the bear example is again like the coconut catapult—the cast-away islanders are not being chased around by bears constantly and exchanging ‘trivial notes’ about how to deal with bear attacks! Their point is that this is a sort of causal model and novel utterance a mere imitation of ‘form’ cannot grant any ‘understanding’ of. (As it happens, they are embarrassingly wrong here, because their bear example is not even wrong. They do not give what they think would be the ‘right’ answer, but whatever answer they gave, it would be wrong—because you are actually supposed to do the exact opposite things for the two major kinds of bears you would be attacked by in North America. Therefore, there is no answer to the question of how to use sticks when ‘a bear’ chases you. IIRC, if you check bear attack safety guidelines, the actual answer is that if one type attacks you, you should use the sticks to try to defend yourself and appear bigger; but if the other type attacks you, this is the worst thing you can possibly do and you need to instead play dead. And if you fix their question to specify the bear type so there is a correct answer, then the LLMs get it right.) You can gauge the robustness & non-falsification of their examples by noting that after I rebutted them back in 2020, they refused to respond, dropped those examples silently without explanation from their later papers, and started calling me an eugenicist.
I assume you mean ‘won’t generalize to answering questions about both modalities’, and that’s false.
I don’t know if there’s anything on this exact scenario, but I wouldn’t be surprised if it could ‘generalize’. Although you would need to nail this down a lot more precisely to avoid them wriggling out of it: does this include stripping out all comments, which will often include input/output examples? Is pretraining on natural language text forbidden? What exactly is a ‘LLM’ and does this rule out all offline RL or model-based RL approaches which try to simulate environments? etc.
Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let’s also suppose we strip text out of images for simplicity.)
Then, we generalize to a context which has both images and text and ask the model “How many dogs are in the image?”