I’m trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn’t intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn’t change how NNs behave that much imo), but it does seem to be an important part of the construction you use.
Good question! The proof of the exact symmetries of this setup, i.e. the precise form of W0, is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:
Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.
Even if W0is uninteresting, the loss landscape might be “nearly singular”. The ReLU has an analytic approximation, the Swish functionσβ(x)=x1+e−βx, where limβ→∞σβ(x)=ReLU(x), which does not yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is “always active” in all subsets of the input domain), which breaks down a lot of the analysis used here.
Suppose, however, that we take a β0 that is so large that from the point of view of your computer, σβ0(x)=ReLU(x) (i.e. their difference is within machine-epsilon). Even though Wswish0 is now a very different object to WReLU0 on paper, the loss landscape will be approximately equal Lswish(w)≈LReLU(w), meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.
So, whilst the precise functional-equivalences might be very different across activation functions (differing W0), there might be many approximate functional equivalences. This is also the sense in which we can wave our arms about “well, SLT only applies to analytic functions, and ReLU isn’t analytic, but who cares”. Making precise mathematical statements about this “nearly singular” phenomena—for example, how does the posterior change as you lower β in σβ(x)? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.
Yeah I agree with everything you say; it’s just I was trying to remind myself of enough of SLT to give a a ‘five minute pitch’ for SLT to other people, and I didn’t like the idea that I’m hanging it of the ReLU.
I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.
I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I’m interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about ‘oh well the posterior is probably-sorta-gaussian’ played a big role in it’s longevity as an idea.
yeah it’s not totally clear what this ‘nearly singular’ thing would mean? Intuitively, it might be that there’s a kind of ‘hidden singularity’ in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing
I’m trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn’t intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn’t change how NNs behave that much imo), but it does seem to be an important part of the construction you use.
Good question! The proof of the exact symmetries of this setup, i.e. the precise form of W0, is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:
Other non-linearities induce singular models. As you note, other non-linear activation functions do lead to singular models. @mfar did some great work on this for tanh networks. Even though the activation function is important, note that the better intuition to have is that the hierarchical nature of a model (e.g. neural networks) is what makes them singular. Deep linear networks are still singular despite an identity activation function. Think of the activation as giving the model more expressiveness.
Even if W0 is uninteresting, the loss landscape might be “nearly singular”. The ReLU has an analytic approximation, the Swish function σβ(x)=x1+e−βx, where limβ→∞σβ(x)=ReLU(x), which does not yield the same symmetries as discussed in this post. This is because the activation boundaries are no longer a sensible thing to study (the swish function is “always active” in all subsets of the input domain), which breaks down a lot of the analysis used here.
Suppose, however, that we take a β0 that is so large that from the point of view of your computer, σβ0(x)=ReLU(x) (i.e. their difference is within machine-epsilon). Even though Wswish0 is now a very different object to WReLU0 on paper, the loss landscape will be approximately equal Lswish(w)≈LReLU(w), meaning that the Bayesian posterior will be practically identical between the two functions and induce the same training dynamics.
So, whilst the precise functional-equivalences might be very different across activation functions (differing W0), there might be many approximate functional equivalences. This is also the sense in which we can wave our arms about “well, SLT only applies to analytic functions, and ReLU isn’t analytic, but who cares”. Making precise mathematical statements about this “nearly singular” phenomena—for example, how does the posterior change as you lower β in σβ(x)? - is under-explored at present (to the best of my knowledge), but it is certainly not something that discredits SLT for all of the reasons I have just explained.
Yeah I agree with everything you say; it’s just I was trying to remind myself of enough of SLT to give a a ‘five minute pitch’ for SLT to other people, and I didn’t like the idea that I’m hanging it of the ReLU.
I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.
I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I’m interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about ‘oh well the posterior is probably-sorta-gaussian’ played a big role in it’s longevity as an idea.
yeah it’s not totally clear what this ‘nearly singular’ thing would mean? Intuitively, it might be that there’s a kind of ‘hidden singularity’ in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing