However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.
Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274
In the case of GELU and some similar activation functions, you’d need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).
There’s some further explanation in Appendix C!
In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads.
This, indeed, is the assumption we wish to relax.
But there’s a lot more to know about a function f than what max({f(x) | x \in X}) is.
Agreed!
Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.
There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it’s surprising when you view features as directions and don’t qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a ‘distribution of validity’ in this post).
Thanks for your comment!
Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274
In the case of GELU and some similar activation functions, you’d need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).
There’s some further explanation in Appendix C!
This, indeed, is the assumption we wish to relax.
Agreed!
There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it’s surprising when you view features as directions and don’t qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a ‘distribution of validity’ in this post).