The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I’d expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?
More specifically: Define the 2-Lipschitz constant of function f at x to be something like inf{a2:∃a0,a1:||f(x+u)−a0−a1.u)||≤12a2||u||2} and its overall 2-Lipschitz constant to be the sup of these. This measures how well f is locally approximable by linear functions. (I expect someone’s already defined a better version of this, probably with a different name, but I think this’ll do.) Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?
[EDITED to add:] I think having a 2-Lipschitz constant in this sense may be equivalent to having a derivative which is a Lipschitz function (and the constant may be its Lipschitz constant, or something like that). So maybe a simpler question is: For networks with activation functions making the loss function differentiable, does batchnorm tend to reduce the Lipschitz constant of its derivative? But given how well rectified linear units work, and that they have a non-differentiable activation function (which will surely make the loss functions fail to be 2-Lipschitz in the sense above) I’m now thinking that if anything like this works it will need to be more sophisticated...
The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I’d expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?
Yeah, in fact I should have been more clear in the post. A very simple way of reducing the Lipschitzness of a function is by simply scaling it by some constant factor. The original paper attempts to show theoretically that batchnorm is doing more than simply scaling. See theorem 4.2 in the paper, and the subsequent observation in section 4.3.
If you think about it though, we can already kind of guess that batch normalization isn’t simply scaling the function. That’s because we measured the gradient predictiveness and discovered that the gradient ended up being much closer to the empirically observed delta-loss than when batch normalization was not enabled. This gives us evidence that the function is locally linear in the way that you described (of course, this can be criticized if you disagree with the way that they measured gradient predictiveness, which focused on measuring the variability of gradient minus actual difference in loss (see figure 4 in the paper)).
Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?
That’s a good question. My guess would be yes due to what I said above, but I am not in a position confidently to say either way. I would have to think more about the exact way that you have defined it. :)
The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I’d expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?
More specifically: Define the 2-Lipschitz constant of function f at x to be something like inf{a2:∃a0,a1:||f(x+u)−a0−a1.u)||≤12a2||u||2} and its overall 2-Lipschitz constant to be the sup of these. This measures how well f is locally approximable by linear functions. (I expect someone’s already defined a better version of this, probably with a different name, but I think this’ll do.) Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?
[EDITED to add:] I think having a 2-Lipschitz constant in this sense may be equivalent to having a derivative which is a Lipschitz function (and the constant may be its Lipschitz constant, or something like that). So maybe a simpler question is: For networks with activation functions making the loss function differentiable, does batchnorm tend to reduce the Lipschitz constant of its derivative? But given how well rectified linear units work, and that they have a non-differentiable activation function (which will surely make the loss functions fail to be 2-Lipschitz in the sense above) I’m now thinking that if anything like this works it will need to be more sophisticated...
Yeah, in fact I should have been more clear in the post. A very simple way of reducing the Lipschitzness of a function is by simply scaling it by some constant factor. The original paper attempts to show theoretically that batchnorm is doing more than simply scaling. See theorem 4.2 in the paper, and the subsequent observation in section 4.3.
If you think about it though, we can already kind of guess that batch normalization isn’t simply scaling the function. That’s because we measured the gradient predictiveness and discovered that the gradient ended up being much closer to the empirically observed delta-loss than when batch normalization was not enabled. This gives us evidence that the function is locally linear in the way that you described (of course, this can be criticized if you disagree with the way that they measured gradient predictiveness, which focused on measuring the variability of gradient minus actual difference in loss (see figure 4 in the paper)).
That’s a good question. My guess would be yes due to what I said above, but I am not in a position confidently to say either way. I would have to think more about the exact way that you have defined it. :)