There’s an interesting tweet thread in which xuan disagrees with some of your predictions because of LLM limitations and the belief that the scaling hypothesis will not hold.
What’s your take on the response to your predictions, and how does it affect your predictions, if you believed the tweet thread?
Additionally, I strongly dislike twitter, but since you claimed there were worthwhile arguments there, I gritted my teeth and dove in. To save others like me from having to experience the same frustration with broken-up bits of discourse intermingled with bad hot-takes and insults from randos, I copy-pasted the relevant information. I don’t think there’s much substance here, other than Xuan noting that she believes that the above article holds only if you assume we stay using tranformer LLMs. If instead we change to using a superior successor to transformer LLMs, this doesn’t hold, and the article should address that. Here’s the thread:
xuan (ɕɥɛn / sh-yen) I respect Jacob a lot but I find it really difficult to engage with predictions of LLM capabilities that presume some version of the scaling hypothesis will continue to hold—it just seems highly implausible given everything we already know about the limits of transformers! If someone can explain how the predictions above could still come true in light of the following findings, that’d honestly be helpful. - Transformers appear unable to learn non-finite or context-free languages, even autoregressively:
These are all limits that I don’t see how “just add data” or “just add compute” could solve. General algorithms can be observationally equivalent with ensembles of heuristics on arbitrarily large datasets as long as the NN has capacity to represent that ensemble. So unless you restrict the capacity of the model or do intense process-based supervision (requiring enough of the desired algorithm to just program it directly in the first place), it seems exceedingly unlikely that transformers would learn generalizable solutions. Some additional thoughts on what autoregressive transformers can express (some variants are Turing-complete), vs. what they can learn, in this thread!”
Teortaxes Replying to @xuanalogue Consider that all those recent proofs of hard limits do not engage with how Transformers are used in practice, and they are qualitatively more expressive when utilized with even rudimentary scaffolding. https://twitter.com/bohang_zhang/s I should also add that I don’t find all the predictions implausible in the original piece—inference time will definitely go down, model copying and parallelization is already happening, as is multimodal training. I just don’t buy the superhuman capabilities. (Also not convinced that multimodal training buys that much—more tasks will become automatable, but I don’t think there’s reason to expect synergistic increase in capabilities. And PaLM-E was pretty underwhelming...)
Asa Cooper Stickland LLMs doesn’t have to mean transformers though. Seems like there is a lot of research effort going into finding better architectures, and 7 years is a decent chunk of time to find them
xuan (ɕɥɛn / sh-yen) Yup—I think predictions should make that clear though! Then it’s not based on the scaling hypothesis but also algorithmic advances—which the post doesn’t base it’s predictions upon, as far as I can tell.
Alyssa Vance This was what I thought for several years, but GPT-4 is a huge data point against, no? It seems to just keep getting better
xuan (ɕɥɛn / sh-yen) My view on the capabilities increase is fairly close to this one! [ed: see referenced tweet from Talia below]
Talia Ringer I also am extra confused why people are freaking out about the AI Doom nonsense now of all times because it seems to come in the wake of GPT-4 and ChatGPT, but the really big jump in capabilities came in the GPT-2 to GPT-3 jump, and recent improvements have been very modest. So like, I think GPT-4 kind of substantiates the view that we are heading nowhere too interesting very quickly. I think we are witnessing mass superstition interacting with cognitive biases and the wish for a romantic reality in which one can be the hero, though!
I can give you my take: it would be foolish to think that GPT-2030 would be an LLM as-we-know-it with the primary change being more params/compute/data. There have already been algorithmic improvements with each GPT version increase, and that will be the primary driver of capabilities advances going forwards. We know, from observing the details of brains via neuroscience, that there are algorithmic advances to be made that will result in huge leaps of ability in specific strategically-relevant domains, such as long-horizon planning and strategic reasoning. In order to project that GPT-2030 won’t have superhuman capabilities, you must explicitly state that you believe the ML research community will fail at replicating these algorithmic capabilities of the human brain. Then you must justify that assertion.
There’s an interesting tweet thread in which xuan disagrees with some of your predictions because of LLM limitations and the belief that the scaling hypothesis will not hold.
What’s your take on the response to your predictions, and how does it affect your predictions, if you believed the tweet thread?
https://twitter.com/xuanalogue/status/1666765447054647297?t=a60XmQsIEsfHpf2O7iGMCg&s=19
Additionally, I strongly dislike twitter, but since you claimed there were worthwhile arguments there, I gritted my teeth and dove in. To save others like me from having to experience the same frustration with broken-up bits of discourse intermingled with bad hot-takes and insults from randos, I copy-pasted the relevant information. I don’t think there’s much substance here, other than Xuan noting that she believes that the above article holds only if you assume we stay using tranformer LLMs. If instead we change to using a superior successor to transformer LLMs, this doesn’t hold, and the article should address that.
Here’s the thread:
xuan (ɕɥɛn / sh-yen)
I respect Jacob a lot but I find it really difficult to engage with predictions of LLM capabilities that presume some version of the scaling hypothesis will continue to hold—it just seems highly implausible given everything we already know about the limits of transformers!
If someone can explain how the predictions above could still come true in light of the following findings, that’d honestly be helpful. - Transformers appear unable to learn non-finite or context-free languages, even autoregressively:
Dennis Ulmer
”Very cool paper to start the week: “Neural Networks and the Chomsky Hierarchy”, showing which NLP architectures are able to generalize to which different formal languages! https://arxiv.org/abs/2207.02098
xuan
Transformers learn shortcuts (via linearized subgraph matching) to multi-step reasoning problems instead of the true algorithm that would systematically generalized: Faith and Fate: Limits of Transformers on Compositionality
Similarly, transformers learn shortcuts to recursive algorithms from input / output examples, instead of the recursive algorithm itself:
“Can Transformers Learn to Solve Problems Recursively?”
These are all limits that I don’t see how “just add data” or “just add compute” could solve. General algorithms can be observationally equivalent with ensembles of heuristics on arbitrarily large datasets as long as the NN has capacity to represent that ensemble.
So unless you restrict the capacity of the model or do intense process-based supervision (requiring enough of the desired algorithm to just program it directly in the first place), it seems exceedingly unlikely that transformers would learn generalizable solutions.
Some additional thoughts on what autoregressive transformers can express (some variants are Turing-complete), vs. what they can learn, in this thread!”
Teortaxes
Replying to @xuanalogue
Consider that all those recent proofs of hard limits do not engage with how Transformers are used in practice, and they are qualitatively more expressive when utilized with even rudimentary scaffolding. https://twitter.com/bohang_zhang/s
I should also add that I don’t find all the predictions implausible in the original piece—inference time will definitely go down, model copying and parallelization is already happening, as is multimodal training. I just don’t buy the superhuman capabilities.
(Also not convinced that multimodal training buys that much—more tasks will become automatable, but I don’t think there’s reason to expect synergistic increase in capabilities. And PaLM-E was pretty underwhelming...)
Asa Cooper Stickland
LLMs doesn’t have to mean transformers though. Seems like there is a lot of research effort going into finding better architectures, and 7 years is a decent chunk of time to find them
xuan (ɕɥɛn / sh-yen)
Yup—I think predictions should make that clear though! Then it’s not based on the scaling hypothesis but also algorithmic advances—which the post doesn’t base it’s predictions upon, as far as I can tell.
Alyssa Vance
This was what I thought for several years, but GPT-4 is a huge data point against, no? It seems to just keep getting better
xuan (ɕɥɛn / sh-yen)
My view on the capabilities increase is fairly close to this one! [ed: see referenced tweet from Talia below]
Talia Ringer
I also am extra confused why people are freaking out about the AI Doom nonsense now of all times because it seems to come in the wake of GPT-4 and ChatGPT, but the really big jump in capabilities came in the GPT-2 to GPT-3 jump, and recent improvements have been very modest. So like, I think GPT-4 kind of substantiates the view that we are heading nowhere too interesting very quickly. I think we are witnessing mass superstition interacting with cognitive biases and the wish for a romantic reality in which one can be the hero, though!
I can give you my take: it would be foolish to think that GPT-2030 would be an LLM as-we-know-it with the primary change being more params/compute/data. There have already been algorithmic improvements with each GPT version increase, and that will be the primary driver of capabilities advances going forwards. We know, from observing the details of brains via neuroscience, that there are algorithmic advances to be made that will result in huge leaps of ability in specific strategically-relevant domains, such as long-horizon planning and strategic reasoning. In order to project that GPT-2030 won’t have superhuman capabilities, you must explicitly state that you believe the ML research community will fail at replicating these algorithmic capabilities of the human brain. Then you must justify that assertion.