I broadly agree with the sentiment of this post, that GPT-2 and BERT tell us new things about language. I don’t think this claim relies on the fact that they’re transformers though—and am skeptical when you say that “the transformer architecture was a real representational advance”, and that “You need the right architecture”. In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don’t think of removing inductive biases as representational advances—or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we’re doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).
Concretely, I’d predict with ~80% confidence that within 3 years, we’ll be able to achieve comparable performance to our current best language models without using transformers—say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?
In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don’t think of removing inductive biases as representational advances—or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we’re doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).
I think it’s misleading to view “amount of inductive bias” as a one-dimensional scale, with the transformer somewhere “between” CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once—weight sharing between positions, and locality—and these are two very different things, not just two (perhaps differently sized) injections of “more bias” on our hypothetical 1D bias scale.
For example, locality without weight sharing is certainly conceivable (I can’t remember if I’ve seen it before), but I’d imagine it would do very poorly on text data, because it relaxes the CNN constraint that’s appropriate for text while keeping the one that’s inappropriate. If you compare that to the transformer, you’ve got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture’s representational aptness for a given domain isn’t just a function of some 1D “amount of inductive bias” in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.
As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that “moving to a superset” shouldn’t be simplified to “reducing some 1D ‘bias’ variable,” I’d also say that “moving to a superset” isn’t what happened anyway.
Concretely, I’d predict with ~80% confidence that within 3 years, we’ll be able to achieve comparable performance to our current best language models without using transformers—say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?
Disagree. Not that this seems deeply impossible or anything, but it’s exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there’s less incentive to do it.
On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.
I broadly agree with the sentiment of this post, that GPT-2 and BERT tell us new things about language. I don’t think this claim relies on the fact that they’re transformers though—and am skeptical when you say that “the transformer architecture was a real representational advance”, and that “You need the right architecture”. In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don’t think of removing inductive biases as representational advances—or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we’re doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).
Concretely, I’d predict with ~80% confidence that within 3 years, we’ll be able to achieve comparable performance to our current best language models without using transformers—say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?
I think it’s misleading to view “amount of inductive bias” as a one-dimensional scale, with the transformer somewhere “between” CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once—weight sharing between positions, and locality—and these are two very different things, not just two (perhaps differently sized) injections of “more bias” on our hypothetical 1D bias scale.
For example, locality without weight sharing is certainly conceivable (I can’t remember if I’ve seen it before), but I’d imagine it would do very poorly on text data, because it relaxes the CNN constraint that’s appropriate for text while keeping the one that’s inappropriate. If you compare that to the transformer, you’ve got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture’s representational aptness for a given domain isn’t just a function of some 1D “amount of inductive bias” in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.
As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that “moving to a superset” shouldn’t be simplified to “reducing some 1D ‘bias’ variable,” I’d also say that “moving to a superset” isn’t what happened anyway.
Disagree. Not that this seems deeply impossible or anything, but it’s exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there’s less incentive to do it.
On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.