Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right?
Yes, exactly (though ≤T for some constant T, which may not be 1, but turn out not to matter).
Is initializing weights that way actually a thing people do?
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully.
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L∞, …).
Thanks Charlie.
Yes, exactly (though ≤T for some constant T, which may not be 1, but turn out not to matter).
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully.
I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L∞, …).
LayerNorm looks interesting, I’ll take a look.