Hmm, I think I understand what you’re pointing at but it’s not obvious to me that the conclusion is correct. If I wear my “infinite hidden width extremist” hat, I’d say that the network after training has extremely similar hidden activations on input x as the network before training. It’s just that the hidden activations have moved in a coordinated way so as to make the output layer come out very differently.
So yeah, the nonlinearities are all there, but they’re fixed nonlinearities of hidden features, and the network’s job is to learn the right linear combination of those fixed nonlinear features.
I’m not confident that this will hold in transformer networks, but I’m not confident it won’t either. Keep in mind that MLPs can learn to multiply, but (if sufficiently wide) they’re still effectively linear models. So the mere existence of nonlinear, multiplicative interactions as a function of the input doesn’t guarantee nonlinearity in the weights.
MLPs learning to multiply binary digits is mostly unrelated. The difference I am talking about (simple linear vs complex non-linear) is perhaps better illustrated by considering networks with exponent activation functions rather than relu. Relu is by-design about as close to linear as one can get while still being useful. With exp activation functions the outputs/gradients are rather obviously non-linear in weights.
Another example again is extreme activation sparsity through k-max hidden layers with low k.
MLPs can learn to multiply general real numbers, not just binary digits, so long as the inputs are bounded. I’m actually not clear on why that example is mostly unrelated. It illustrates that you can have an arbitrary nonlinear circuit in part of the network while still being effectively linear in terms of weights, due to the weights staying in a small neighborhood of initialization. It’s actually not at all obvious to me that exponential activation functions would ruin this property. In fact I suspect they don’t in the infinite width limit, although that infinite width limit might be a worse approximation in practice.
Note that the question is ultimately not whether the network is truly linear in weights, but whether it’s effectively linear in weights over the range they move in. A nonlinear smooth function can be usefully treated as linear if we constrain ourselves to a small enough neighborhood. What’s not obvious to me is whether this approximation works for transformers. I wouldn’t be surprised either way.
Hmm, I think I understand what you’re pointing at but it’s not obvious to me that the conclusion is correct. If I wear my “infinite hidden width extremist” hat, I’d say that the network after training has extremely similar hidden activations on input x as the network before training. It’s just that the hidden activations have moved in a coordinated way so as to make the output layer come out very differently.
So yeah, the nonlinearities are all there, but they’re fixed nonlinearities of hidden features, and the network’s job is to learn the right linear combination of those fixed nonlinear features.
I’m not confident that this will hold in transformer networks, but I’m not confident it won’t either. Keep in mind that MLPs can learn to multiply, but (if sufficiently wide) they’re still effectively linear models. So the mere existence of nonlinear, multiplicative interactions as a function of the input doesn’t guarantee nonlinearity in the weights.
MLPs learning to multiply binary digits is mostly unrelated. The difference I am talking about (simple linear vs complex non-linear) is perhaps better illustrated by considering networks with exponent activation functions rather than relu. Relu is by-design about as close to linear as one can get while still being useful. With exp activation functions the outputs/gradients are rather obviously non-linear in weights.
Another example again is extreme activation sparsity through k-max hidden layers with low k.
MLPs can learn to multiply general real numbers, not just binary digits, so long as the inputs are bounded. I’m actually not clear on why that example is mostly unrelated. It illustrates that you can have an arbitrary nonlinear circuit in part of the network while still being effectively linear in terms of weights, due to the weights staying in a small neighborhood of initialization. It’s actually not at all obvious to me that exponential activation functions would ruin this property. In fact I suspect they don’t in the infinite width limit, although that infinite width limit might be a worse approximation in practice.
Note that the question is ultimately not whether the network is truly linear in weights, but whether it’s effectively linear in weights over the range they move in. A nonlinear smooth function can be usefully treated as linear if we constrain ourselves to a small enough neighborhood. What’s not obvious to me is whether this approximation works for transformers. I wouldn’t be surprised either way.