MLPs learning to multiply binary digits is mostly unrelated. The difference I am talking about (simple linear vs complex non-linear) is perhaps better illustrated by considering networks with exponent activation functions rather than relu. Relu is by-design about as close to linear as one can get while still being useful. With exp activation functions the outputs/gradients are rather obviously non-linear in weights.
Another example again is extreme activation sparsity through k-max hidden layers with low k.
MLPs can learn to multiply general real numbers, not just binary digits, so long as the inputs are bounded. I’m actually not clear on why that example is mostly unrelated. It illustrates that you can have an arbitrary nonlinear circuit in part of the network while still being effectively linear in terms of weights, due to the weights staying in a small neighborhood of initialization. It’s actually not at all obvious to me that exponential activation functions would ruin this property. In fact I suspect they don’t in the infinite width limit, although that infinite width limit might be a worse approximation in practice.
Note that the question is ultimately not whether the network is truly linear in weights, but whether it’s effectively linear in weights over the range they move in. A nonlinear smooth function can be usefully treated as linear if we constrain ourselves to a small enough neighborhood. What’s not obvious to me is whether this approximation works for transformers. I wouldn’t be surprised either way.
MLPs learning to multiply binary digits is mostly unrelated. The difference I am talking about (simple linear vs complex non-linear) is perhaps better illustrated by considering networks with exponent activation functions rather than relu. Relu is by-design about as close to linear as one can get while still being useful. With exp activation functions the outputs/gradients are rather obviously non-linear in weights.
Another example again is extreme activation sparsity through k-max hidden layers with low k.
MLPs can learn to multiply general real numbers, not just binary digits, so long as the inputs are bounded. I’m actually not clear on why that example is mostly unrelated. It illustrates that you can have an arbitrary nonlinear circuit in part of the network while still being effectively linear in terms of weights, due to the weights staying in a small neighborhood of initialization. It’s actually not at all obvious to me that exponential activation functions would ruin this property. In fact I suspect they don’t in the infinite width limit, although that infinite width limit might be a worse approximation in practice.
Note that the question is ultimately not whether the network is truly linear in weights, but whether it’s effectively linear in weights over the range they move in. A nonlinear smooth function can be usefully treated as linear if we constrain ourselves to a small enough neighborhood. What’s not obvious to me is whether this approximation works for transformers. I wouldn’t be surprised either way.