Copy-pasting the transfomer vs LSTM graph for reference (the one with the bigger gap):
If you told me that AGI looks like that graph, where you replace “flounders at 100M parameters” with “flounders at the scale where people are currently doing AGI research,” then I don’t think that’s going to give you a hard takeoff.
If you said “actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM” then I agree that will give you a hard takeoff, but that’s what I’m saying transformers aren’t a good example of. In general I think that things tend to get more efficient/smooth as fields scale up, rather than less efficient, even though the upside from innovations that improve scaling is larger.
If you said “actually people won’t even be doing AGI research with a large fraction of the world’s compute, so we’ll have a modest improvement that allows scaling followed by a super rapid scaleup” then it seems like that’s got to translate into a bet about compute budgets in the near-ish future. I agree that AI compute has been scaling up rapidly from a tiny base, but I don’t think that is likely to happen in the endgame (because most of the feasible scaleup will have already occurred).
If you said “actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM” then I agree that will give you a hard takeoff, but that’s what I’m saying transformers aren’t a good example of.
Why not? Here we have a pretty clean break: RNNs are not a tweak or two away from Transformers. We have one large important family of algorithms, which we can empirically demonstrate do not absorb usefully the compute which another later discretely different family does, and which is responsible for increasingly more compute, and the longer that family of improvements was forgone, the more compute overhang there would’ve been to exploit.
In a world where Transformers did not exist, we would not be talking about GPRNN-3 as a followup to GPRNN-2, which followupped OA’s original & much-unloved GPT-1 RNN. What would happen is that OA would put $10m into GPRNN-3, observe that it didn’t go anywhere (hard to eyeball the curves but I wonder if it’d work even as well as GPT-2 did?), and the status quo of <100m-parameter RNNs would just keep going. There would not be any Switch Transformer, any WuDao, any HyperClova, any Pangu-Alpha, any Pathways/LaMDA/MUM, FB’s scaleup program in audio & translation wouldn’t be going… (There probably wouldn’t be any MLP renaissance either, as everyone seems to get there by asking ‘how much of a Transformer do we need anyway? how much can I ablate away? hm, looks like “all of it” when I start with a modern foundation with normalized layers?‘) We know what would’ve happened without Transformers: nothing. We can observe the counterfactual by simply looking: no magic RNNs dropped out of the sky merely to ‘make line go straight brrr’. It would simply be yet another sigmoid ending and an exciting field turning into a ‘mature technology’: “well, we scaled up RNNs and they worked pretty well, but it’ll require new approaches or way more compute than we’ll have for decades to come, oh well, let’s dick around until then.” Such a plateau would be no surprise, any more than it ought to be surprising that in 2021 you or I are not flying around on hypersonic rocket-jet personal pod cars the way everyone in aerospace was forecasting in the 1950s by projecting out centuries of speed increases.
The counterfactual depends on what other research people would have done and how successful it would have been. I don’t think you can observe it “by simply looking.”
That said, I’m not quite sure what counterfactual you are imagining. By the time transformers were developed, soft attention in combination with LSTMs was already popular. I assume that in your counterfactual soft attention didn’t ever catch on? Was it proposed in 2014 but languished in obscurity and no one picked it up? Or was sequence-to-sequence attention widely used, but no one ever considered self-attention? Or something else?
Depending on how you are defining the counterfactual, I may think that you are right about the consequences. But if you are talking about a counterfactual that I regard as implausible, then naturally it’s not as interesting to me as things that actually happen. That’s what I was looking for in the quoted part of the OP—and evaluating transformers in terms of their (large!) actual impact rather than an imagined hypothetical where they could lead to fast-takeoff-like consequences.
I think transformers are a big deal, but I think this comment is a bad guess at the counterfactual and it reaffirms my desire to bet with you about either history or the future. One bet down, handful to go?
Is this something that you’ve changed your mind on recently, or have I just misunderstood your previous stance? I don’t know if it would be polite to throw old quotes off Discord at you, but my understanding is that you expected most model differences vanished in the limit, and that convolutions and RNNs and whatnot might well have held up fine with only minor tweaks to remove scaling bottlenecks.
I bring this up because that stance I thought you had seems to agree with Paul, whereas now you seem to disagree with him.
Copy-pasting the transfomer vs LSTM graph for reference (the one with the bigger gap):
If you told me that AGI looks like that graph, where you replace “flounders at 100M parameters” with “flounders at the scale where people are currently doing AGI research,” then I don’t think that’s going to give you a hard takeoff.
If you said “actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM” then I agree that will give you a hard takeoff, but that’s what I’m saying transformers aren’t a good example of. In general I think that things tend to get more efficient/smooth as fields scale up, rather than less efficient, even though the upside from innovations that improve scaling is larger.
If you said “actually people won’t even be doing AGI research with a large fraction of the world’s compute, so we’ll have a modest improvement that allows scaling followed by a super rapid scaleup” then it seems like that’s got to translate into a bet about compute budgets in the near-ish future. I agree that AI compute has been scaling up rapidly from a tiny base, but I don’t think that is likely to happen in the endgame (because most of the feasible scaleup will have already occurred).
Why not? Here we have a pretty clean break: RNNs are not a tweak or two away from Transformers. We have one large important family of algorithms, which we can empirically demonstrate do not absorb usefully the compute which another later discretely different family does, and which is responsible for increasingly more compute, and the longer that family of improvements was forgone, the more compute overhang there would’ve been to exploit.
In a world where Transformers did not exist, we would not be talking about GPRNN-3 as a followup to GPRNN-2, which followupped OA’s original & much-unloved GPT-1 RNN. What would happen is that OA would put $10m into GPRNN-3, observe that it didn’t go anywhere (hard to eyeball the curves but I wonder if it’d work even as well as GPT-2 did?), and the status quo of <100m-parameter RNNs would just keep going. There would not be any Switch Transformer, any WuDao, any HyperClova, any Pangu-Alpha, any Pathways/LaMDA/MUM, FB’s scaleup program in audio & translation wouldn’t be going… (There probably wouldn’t be any MLP renaissance either, as everyone seems to get there by asking ‘how much of a Transformer do we need anyway? how much can I ablate away? hm, looks like “all of it” when I start with a modern foundation with normalized layers?‘) We know what would’ve happened without Transformers: nothing. We can observe the counterfactual by simply looking: no magic RNNs dropped out of the sky merely to ‘make line go straight brrr’. It would simply be yet another sigmoid ending and an exciting field turning into a ‘mature technology’: “well, we scaled up RNNs and they worked pretty well, but it’ll require new approaches or way more compute than we’ll have for decades to come, oh well, let’s dick around until then.” Such a plateau would be no surprise, any more than it ought to be surprising that in 2021 you or I are not flying around on hypersonic rocket-jet personal pod cars the way everyone in aerospace was forecasting in the 1950s by projecting out centuries of speed increases.
The counterfactual depends on what other research people would have done and how successful it would have been. I don’t think you can observe it “by simply looking.”
That said, I’m not quite sure what counterfactual you are imagining. By the time transformers were developed, soft attention in combination with LSTMs was already popular. I assume that in your counterfactual soft attention didn’t ever catch on? Was it proposed in 2014 but languished in obscurity and no one picked it up? Or was sequence-to-sequence attention widely used, but no one ever considered self-attention? Or something else?
Depending on how you are defining the counterfactual, I may think that you are right about the consequences. But if you are talking about a counterfactual that I regard as implausible, then naturally it’s not as interesting to me as things that actually happen. That’s what I was looking for in the quoted part of the OP—and evaluating transformers in terms of their (large!) actual impact rather than an imagined hypothetical where they could lead to fast-takeoff-like consequences.
Want to +1 that a vaguer version of this was my own rough sense of RNNs vs. CNNs vs. Transformers.
I think transformers are a big deal, but I think this comment is a bad guess at the counterfactual and it reaffirms my desire to bet with you about either history or the future. One bet down, handful to go?
Is this something that you’ve changed your mind on recently, or have I just misunderstood your previous stance? I don’t know if it would be polite to throw old quotes off Discord at you, but my understanding is that you expected most model differences vanished in the limit, and that convolutions and RNNs and whatnot might well have held up fine with only minor tweaks to remove scaling bottlenecks.
I bring this up because that stance I thought you had seems to agree with Paul, whereas now you seem to disagree with him.