However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you)
I don’t feel emphatic about it. Well, I have a model in my head and within that model a transformer can’t scale to AGI, and I was describing that model here, but (1) I’m uncertain that that model is the right way to think about things, (2) even if it is, I don’t have high confidence that I’m properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven’t thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.
† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I’m not sure, I haven’t really digested it.
the transformer is being “guided” to do implicit meta-learning … I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
I’m confused about how you’re using the terms “online learning” and “meta-learning” here.
I generally understand “online learning” in the sense of this, where you’re editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you’re generating text with GPT-3, then there’s no labeled data to update on, and the weights are fixed, so it’s not online learning by definition. I guess you have something else in mind; can you explain?
I generally understand “meta-learning” to mean that there’s an inner loop that has a learning algorithm, and then there’s an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn’t look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps, and maybe it “figures something out” in step 20 and still “knows” that thing in step 84. So OK fine, I guess you can claim that there’s “learning” happening within a single GPT-3 word-inference task, even if it’s not the kind of learning that people normally talk about. And then it would be fair to say that the gradient descent training of GPT-3 is meta-learning. Is that what you have in mind? If so, can’t you equally well say that a 96-layer ConvNet training involves “meta-learning”? Sorry if I’m misunderstanding.
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
I was confused for a while because “weight” already has a definition—”weights” are the things that you find by gradient descent (a.k.a. “weights & biases” a.k.a. “parameters”). I guess what you’re saying is that I was putting a lot of emphasis on the idea that the algorithm is something like:
Calculate (28th entry of the input) × 8.21 + (12th entry of the input × 2.47) + … , then do a nonlinear operation, then do lots more calculations like that, etc. etc.
whereas in a transformer there are also things in the calculation that look like “(14th entry of the input) × (92nd entry of the input)”, i.e. multiplying functions of the input by other functions of the input. Is that what you’re saying?
If so, no I was using the term “matrix multiplications and ReLUs” in a more general way that didn’t exclude the possibility of having functions-of-the-input be part of each of two matrices that then get multiplied together. I hadn’t thought about that as being a distinguishing feature of transformers until now. I suppose that does seem important for widening the variety of calculations that are possible, but I’m not quite following your argument that this is related to meta-learning. Maybe this is related to the previous section, because you’re (re)defining “weights” as sorta “the entries of the matrices that you’re multiplying by”, and also thinking of “weights” as “when weights are modified, that’s learning”, and therefore transformers are “learning” within a single word-inference task in a way that ConvNets aren’t. Is that right? If so, I dunno, it seems like the argument is leaning too much on mixing up those two definitions of “weights”, and you need some extra argument that “the entries of the matrices that you’re multiplying by” have that special status such that if they’re functions of the inputs then it’s “learning” and if they aren’t then it isn’t. Again, sorry if I’m misunderstanding.
Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).
I don’t feel emphatic about it. Well, I have a model in my head and within that model a transformer can’t scale to AGI, and I was describing that model here, but (1) I’m uncertain that that model is the right way to think about things, (2) even if it is, I don’t have high confidence that I’m properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven’t thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.
† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I’m not sure, I haven’t really digested it.
I’m confused about how you’re using the terms “online learning” and “meta-learning” here.
I generally understand “online learning” in the sense of this, where you’re editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you’re generating text with GPT-3, then there’s no labeled data to update on, and the weights are fixed, so it’s not online learning by definition. I guess you have something else in mind; can you explain?
I generally understand “meta-learning” to mean that there’s an inner loop that has a learning algorithm, and then there’s an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn’t look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps, and maybe it “figures something out” in step 20 and still “knows” that thing in step 84. So OK fine, I guess you can claim that there’s “learning” happening within a single GPT-3 word-inference task, even if it’s not the kind of learning that people normally talk about. And then it would be fair to say that the gradient descent training of GPT-3 is meta-learning. Is that what you have in mind? If so, can’t you equally well say that a 96-layer ConvNet training involves “meta-learning”? Sorry if I’m misunderstanding.
I was confused for a while because “weight” already has a definition—”weights” are the things that you find by gradient descent (a.k.a. “weights & biases” a.k.a. “parameters”). I guess what you’re saying is that I was putting a lot of emphasis on the idea that the algorithm is something like:
Calculate (28th entry of the input) × 8.21 + (12th entry of the input × 2.47) + … , then do a nonlinear operation, then do lots more calculations like that, etc. etc.
whereas in a transformer there are also things in the calculation that look like “(14th entry of the input) × (92nd entry of the input)”, i.e. multiplying functions of the input by other functions of the input. Is that what you’re saying?
If so, no I was using the term “matrix multiplications and ReLUs” in a more general way that didn’t exclude the possibility of having functions-of-the-input be part of each of two matrices that then get multiplied together. I hadn’t thought about that as being a distinguishing feature of transformers until now. I suppose that does seem important for widening the variety of calculations that are possible, but I’m not quite following your argument that this is related to meta-learning. Maybe this is related to the previous section, because you’re (re)defining “weights” as sorta “the entries of the matrices that you’re multiplying by”, and also thinking of “weights” as “when weights are modified, that’s learning”, and therefore transformers are “learning” within a single word-inference task in a way that ConvNets aren’t. Is that right? If so, I dunno, it seems like the argument is leaning too much on mixing up those two definitions of “weights”, and you need some extra argument that “the entries of the matrices that you’re multiplying by” have that special status such that if they’re functions of the inputs then it’s “learning” and if they aren’t then it isn’t. Again, sorry if I’m misunderstanding.
Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).