I’m confused by your notation for feed-forward layers.
What justifies re-using the same labels (“apple” etc.) for
the coordinates of x
the coordinates of x⋅A, i.e. the basis in which the nonlinearity operates
?
If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by A, or which vectors/semes they get mapped to by B.
But your labels don’t correspond to either of these interpretations. Instead, it looks like you are following rules of the form “the 4th component of every basis is called ‘yum’,” which leads you to label a coordinate “yum” even though it’s neither mapped from “yum” by A, nor mapped to “yum” by B.
This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case. In transformers, (2) is typically larger by a factor of 4. The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:
neuron1=ReLU(cherry+durian−1)
neuron2=ReLU(apple+banana−1)
neuron3=ReLU(apple+banana)
yum=neuron3−neuron2
yuck=−1∗neuron1
with some arbitrary choices about which multiplicative constants to absorb into A and a vs. which to absorb into B.
Thanks for your comments/questions, they’re very insightful.
In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There’s a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.)
In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it’s not even theoretically possible to represent it using the same basis. I’ve found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I’ve found the feed-forward layers to be most useful for conjunctions and disjunctions—and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large.
Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.
I’m confused by your notation for feed-forward layers.
What justifies re-using the same labels (“apple” etc.) for
the coordinates of x
the coordinates of x⋅A, i.e. the basis in which the nonlinearity operates
?
If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by A, or which vectors/semes they get mapped to by B.
But your labels don’t correspond to either of these interpretations. Instead, it looks like you are following rules of the form “the 4th component of every basis is called ‘yum’,” which leads you to label a coordinate “yum” even though it’s neither mapped from “yum” by A, nor mapped to “yum” by B.
This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case. In transformers, (2) is typically larger by a factor of 4. The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:
neuron1=ReLU(cherry+durian−1)
neuron2=ReLU(apple+banana−1)
neuron3=ReLU(apple+banana)
yum=neuron3−neuron2
yuck=−1∗neuron1
with some arbitrary choices about which multiplicative constants to absorb into A and a vs. which to absorb into B.
Thanks for your comments/questions, they’re very insightful.
In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There’s a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.)
In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it’s not even theoretically possible to represent it using the same basis. I’ve found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I’ve found the feed-forward layers to be most useful for conjunctions and disjunctions—and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large.
Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.