skosch comments on Re-Examining LayerNorm

skosch 2 Dec 2022 1:19 UTC
5 points
0
That was my first thought as well. As far as I know, the most popular simple model used for this in the neuro literature, divisive normalization, uses similar but not quite identical formula. Different authors use different variations, but it’s something shaped like
$z_{i} = \frac{y_{i}^{α}}{β^{α} + \sum_{j} κ_{i j} y_{j}^{α}}$
where $y_{i}$ is the unit’s activation before lateral inhibition, $β$ adds a shift/bias, $κ_{i j}$ are the respective inhibition coefficients, and the exponent $α$ modulates the sharpness of the sigmoid (2 is a typical value). Here’s an interactive desmos plot with just a single self-inhibiting unit. This function is asymmetric in the way you describe, if I understand you correctly, but to my knowledge it’s never gained any popularity outside of its niche. The ML community seems to much prefer Softmax, LayerNorm et al. and I’m curious if anyone knows if there’s a deep technical reason for these different choices.
- Charlie Steiner 2 Dec 2022 9:17 UTC
  3 points
  0
  Parent
  I think in feed-forward networks (i.e. they don’t re-use the same neuron multiple times), having to learn all the $k_{i j}$ inhibition coefficients is too much to ask. RNNs have gone in an out of fashion, and maybe they could use something like this (maybe scaled down a little), but you could achieve similar inhibition effects with multiple different architectures—LSTMs already have multiplication built into them, but in a different way. There is not a particularly deep technical reason for different choices.