Neat! Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Is initializing weights that way actually a thing people do?
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I’m reminded of a recent post about LayerNorm. LayerNorm seems like it squeezes the function back down closer to the unit interval, increasing the density of nonlinearities.
Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right?
Yes, exactly (though ≤T for some constant T, which may not be 1, but turn out not to matter).
Is initializing weights that way actually a thing people do?
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully.
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L∞, …).
Neat! Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right? Is initializing weights that way actually a thing people do?
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I’m reminded of a recent post about LayerNorm. LayerNorm seems like it squeezes the function back down closer to the unit interval, increasing the density of nonlinearities.
Thanks Charlie.
Yes, exactly (though ≤T for some constant T, which may not be 1, but turn out not to matter).
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully.
I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L∞, …).
LayerNorm looks interesting, I’ll take a look.