Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of “fixed” weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I’m mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we’d expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
Caveats: It’s reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:
They’re data hungry, like most DNNs, at least during the pre-training phase.
They’re not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.
Looking closer: How Transformers Depart “Functionally” from Most DNNs
Across two weight matrices of a fully-connected DNN, we see something like:
σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
These activations are “dynamic”. But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.
In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
(Skippable) As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs
(Skippable) Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | ki∈K,qi∈Q, i<=n}, where each row A_i of A has entries mapped to an ordered set S_i
Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.
3. Onto the counterargument:
Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may’ve envisioned.
We specifically see that:
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
It is possible that the transformer explicitly conditions a self-attention matrix A_l at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is because we’re only giving the transformer L layers to compute the final result.
Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
(a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
(b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
(c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
(d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it’s enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it’s an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I’ve read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you)
I don’t feel emphatic about it. Well, I have a model in my head and within that model a transformer can’t scale to AGI, and I was describing that model here, but (1) I’m uncertain that that model is the right way to think about things, (2) even if it is, I don’t have high confidence that I’m properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven’t thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.
† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I’m not sure, I haven’t really digested it.
the transformer is being “guided” to do implicit meta-learning … I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
I’m confused about how you’re using the terms “online learning” and “meta-learning” here.
I generally understand “online learning” in the sense of this, where you’re editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you’re generating text with GPT-3, then there’s no labeled data to update on, and the weights are fixed, so it’s not online learning by definition. I guess you have something else in mind; can you explain?
I generally understand “meta-learning” to mean that there’s an inner loop that has a learning algorithm, and then there’s an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn’t look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps, and maybe it “figures something out” in step 20 and still “knows” that thing in step 84. So OK fine, I guess you can claim that there’s “learning” happening within a single GPT-3 word-inference task, even if it’s not the kind of learning that people normally talk about. And then it would be fair to say that the gradient descent training of GPT-3 is meta-learning. Is that what you have in mind? If so, can’t you equally well say that a 96-layer ConvNet training involves “meta-learning”? Sorry if I’m misunderstanding.
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
I was confused for a while because “weight” already has a definition—”weights” are the things that you find by gradient descent (a.k.a. “weights & biases” a.k.a. “parameters”). I guess what you’re saying is that I was putting a lot of emphasis on the idea that the algorithm is something like:
Calculate (28th entry of the input) × 8.21 + (12th entry of the input × 2.47) + … , then do a nonlinear operation, then do lots more calculations like that, etc. etc.
whereas in a transformer there are also things in the calculation that look like “(14th entry of the input) × (92nd entry of the input)”, i.e. multiplying functions of the input by other functions of the input. Is that what you’re saying?
If so, no I was using the term “matrix multiplications and ReLUs” in a more general way that didn’t exclude the possibility of having functions-of-the-input be part of each of two matrices that then get multiplied together. I hadn’t thought about that as being a distinguishing feature of transformers until now. I suppose that does seem important for widening the variety of calculations that are possible, but I’m not quite following your argument that this is related to meta-learning. Maybe this is related to the previous section, because you’re (re)defining “weights” as sorta “the entries of the matrices that you’re multiplying by”, and also thinking of “weights” as “when weights are modified, that’s learning”, and therefore transformers are “learning” within a single word-inference task in a way that ConvNets aren’t. Is that right? If so, I dunno, it seems like the argument is leaning too much on mixing up those two definitions of “weights”, and you need some extra argument that “the entries of the matrices that you’re multiplying by” have that special status such that if they’re functions of the inputs then it’s “learning” and if they aren’t then it isn’t. Again, sorry if I’m misunderstanding.
Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).
Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of “fixed” weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I’m mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we’d expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
Caveats: It’s reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:
They’re data hungry, like most DNNs, at least during the pre-training phase.
They’re not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.
Looking closer: How Transformers Depart “Functionally” from Most DNNs
Across two weight matrices of a fully-connected DNN, we see something like:
σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
These activations are “dynamic”. But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.
In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
(Skippable) As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs
(Skippable) Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | ki∈K,qi∈Q, i<=n}, where each row A_i of A has entries mapped to an ordered set S_i
Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.
3. Onto the counterargument:
Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may’ve envisioned.
We specifically see that:
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
It is possible that the transformer explicitly conditions a self-attention matrix A_l at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is because we’re only giving the transformer L layers to compute the final result.
Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
(a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
(b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
(c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
(d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it’s enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it’s an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I’ve read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.
I don’t feel emphatic about it. Well, I have a model in my head and within that model a transformer can’t scale to AGI, and I was describing that model here, but (1) I’m uncertain that that model is the right way to think about things, (2) even if it is, I don’t have high confidence that I’m properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven’t thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.
† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I’m not sure, I haven’t really digested it.
I’m confused about how you’re using the terms “online learning” and “meta-learning” here.
I generally understand “online learning” in the sense of this, where you’re editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you’re generating text with GPT-3, then there’s no labeled data to update on, and the weights are fixed, so it’s not online learning by definition. I guess you have something else in mind; can you explain?
I generally understand “meta-learning” to mean that there’s an inner loop that has a learning algorithm, and then there’s an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn’t look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps, and maybe it “figures something out” in step 20 and still “knows” that thing in step 84. So OK fine, I guess you can claim that there’s “learning” happening within a single GPT-3 word-inference task, even if it’s not the kind of learning that people normally talk about. And then it would be fair to say that the gradient descent training of GPT-3 is meta-learning. Is that what you have in mind? If so, can’t you equally well say that a 96-layer ConvNet training involves “meta-learning”? Sorry if I’m misunderstanding.
I was confused for a while because “weight” already has a definition—”weights” are the things that you find by gradient descent (a.k.a. “weights & biases” a.k.a. “parameters”). I guess what you’re saying is that I was putting a lot of emphasis on the idea that the algorithm is something like:
Calculate (28th entry of the input) × 8.21 + (12th entry of the input × 2.47) + … , then do a nonlinear operation, then do lots more calculations like that, etc. etc.
whereas in a transformer there are also things in the calculation that look like “(14th entry of the input) × (92nd entry of the input)”, i.e. multiplying functions of the input by other functions of the input. Is that what you’re saying?
If so, no I was using the term “matrix multiplications and ReLUs” in a more general way that didn’t exclude the possibility of having functions-of-the-input be part of each of two matrices that then get multiplied together. I hadn’t thought about that as being a distinguishing feature of transformers until now. I suppose that does seem important for widening the variety of calculations that are possible, but I’m not quite following your argument that this is related to meta-learning. Maybe this is related to the previous section, because you’re (re)defining “weights” as sorta “the entries of the matrices that you’re multiplying by”, and also thinking of “weights” as “when weights are modified, that’s learning”, and therefore transformers are “learning” within a single word-inference task in a way that ConvNets aren’t. Is that right? If so, I dunno, it seems like the argument is leaning too much on mixing up those two definitions of “weights”, and you need some extra argument that “the entries of the matrices that you’re multiplying by” have that special status such that if they’re functions of the inputs then it’s “learning” and if they aren’t then it isn’t. Again, sorry if I’m misunderstanding.
Standard feedforward DNNs encompass “circuits buildable from matmul and RELU”, however crucially the backprop gradient update necessarily includes another key operator—matrix transpose.
Transformers are a semi-special case of attention/memory augmented networks, which encompass “circuits buildable from matmul, RELU, and transpose”—and thus they incorporate dynamic multiplicative interactions which enable (at least) the ability to learn (or quickly memorize) into the forward pass.
So yes adding that transpose into the forward/inference greatly expands the space of circuits you can efficiently emulate. It’s not obvious how many more such fundamental ops one needs for AGI. Brain circuits don’t obviously have much more key func components beyond matmul, RELU, multiplicative gating interactions, and efficient sparsity. (Brains also feature many other oddities like mult/exponential updates vs linear and various related non-negative constraints, but unclear how important those are).