The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I’d expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?
Yeah, in fact I should have been more clear in the post. A very simple way of reducing the Lipschitzness of a function is by simply scaling it by some constant factor. The original paper attempts to show theoretically that batchnorm is doing more than simply scaling. See theorem 4.2 in the paper, and the subsequent observation in section 4.3.
If you think about it though, we can already kind of guess that batch normalization isn’t simply scaling the function. That’s because we measured the gradient predictiveness and discovered that the gradient ended up being much closer to the empirically observed delta-loss than when batch normalization was not enabled. This gives us evidence that the function is locally linear in the way that you described (of course, this can be criticized if you disagree with the way that they measured gradient predictiveness, which focused on measuring the variability of gradient minus actual difference in loss (see figure 4 in the paper)).
Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?
That’s a good question. My guess would be yes due to what I said above, but I am not in a position confidently to say either way. I would have to think more about the exact way that you have defined it. :)
Yeah, in fact I should have been more clear in the post. A very simple way of reducing the Lipschitzness of a function is by simply scaling it by some constant factor. The original paper attempts to show theoretically that batchnorm is doing more than simply scaling. See theorem 4.2 in the paper, and the subsequent observation in section 4.3.
If you think about it though, we can already kind of guess that batch normalization isn’t simply scaling the function. That’s because we measured the gradient predictiveness and discovered that the gradient ended up being much closer to the empirically observed delta-loss than when batch normalization was not enabled. This gives us evidence that the function is locally linear in the way that you described (of course, this can be criticized if you disagree with the way that they measured gradient predictiveness, which focused on measuring the variability of gradient minus actual difference in loss (see figure 4 in the paper)).
That’s a good question. My guess would be yes due to what I said above, but I am not in a position confidently to say either way. I would have to think more about the exact way that you have defined it. :)