How do we check empirically or otherwise whether this explanation of what batch normalization does is correct?
Great question. I should be giving a partial answer in tomorrow’s post. The bare minimum we can do is check if there’s a way to define internal covariate shift (ICS) rigorously, and then measure how much the technique is reducing it. What Shibani Santurkar et. al. found was
Surprisingly, we observe that networks with BatchNorm often exhibit an increase in ICS (cf. Figure 3). This is particularly striking in the case of [deep linear networks]. In fact, in this case, the standard network experiences almost no ICS for the entirety of training, whereas for BatchNorm it appears that G and G0 are almost uncorrelated. We emphasize that this is the case even though BatchNorm networks continue to perform drastically better in terms of attained accuracy and loss.
Large internal covariate shift means that if we choose ε>0, perform some SGD steps, get some θ, and look at the function’s graph in ε-area of θ, it doesn’t really look like a plane, it’s more curvy like. And small internal covariate shift means that the function’s graph is more like a plane. Hence gradient descent works better. Is this intuition correct?
Interesting. If I understand your intuition correctly, you are essentially imagining internal covariate shift to be a measure of the smoothness of the gradient (and its loss) around the parameters θ. Is that correct?
In that case, you are in some sense already capturing the intuition (as I understand it) for why batch normalization really works rather than why I said it works above. The newer paper puts a more narrow spin on this, by saying roughly that the gradient around ϵ has an improvement in the Lipschitzness.
Personally, I don’t view internal covariate shift that way. Of course, until it’s rigorously defined (which it certainly can be) there’s no clear interpretation either way.
Why does the internal covariate shift become less, even though we have μ and β terms?
This was the part I understood the least, I think. But the way that I understand it is that by allowing the model to choose μ and β, it can choose from a variety of distributions, while maintaining structure (specifically, it is still normalized). As long as μ and β don’t change too rapidly, I think the idea is that it shouldn’t contribute too heavily towards shifting the distribution in a way that is bad.
Can we fix it somehow? What if we make an optimizer that allows only 1 weight to change sign at each iteration?
This is an interesting approach. I’d have to think about it more, and how it interacts with my example. I remember reading somewhere that researchers once tried to only change one layer at a time, but this ended up being too slow.
Does batch normalization really cause the distribution of activations of a neuron be more like a Gaussian? Is that like an empirical observation of what happens when a neural network with batch normalization is optimized by an SGD-like optimizer?
I will admit to being imprecise in the way I worded that part. I wanted a way of conveying that the transformation was intended to control the shape of the distribution, in order to make it similar across training steps. A Gaussian is a well behaved shape, which is easy for the layer to have as its distribution.
In fact the original paper responds to this point of yours fairly directly,
In reality, the transformation is not linear, and the normalized values are not guaranteed to be Gaussian nor independent, but we nevertheless expect Batch Normalization to help make gradient propagation better behaved.
As for posting about deep learning, I was just hoping that there would be enough people here who would be interested. Looks like there might be, given that you replied. :)
Great question. I should be giving a partial answer in tomorrow’s post. The bare minimum we can do is check if there’s a way to define internal covariate shift (ICS) rigorously, and then measure how much the technique is reducing it. What Shibani Santurkar et. al. found was
Interesting. If I understand your intuition correctly, you are essentially imagining internal covariate shift to be a measure of the smoothness of the gradient (and its loss) around the parameters θ. Is that correct?
In that case, you are in some sense already capturing the intuition (as I understand it) for why batch normalization really works rather than why I said it works above. The newer paper puts a more narrow spin on this, by saying roughly that the gradient around ϵ has an improvement in the Lipschitzness.
Personally, I don’t view internal covariate shift that way. Of course, until it’s rigorously defined (which it certainly can be) there’s no clear interpretation either way.
This was the part I understood the least, I think. But the way that I understand it is that by allowing the model to choose μ and β, it can choose from a variety of distributions, while maintaining structure (specifically, it is still normalized). As long as μ and β don’t change too rapidly, I think the idea is that it shouldn’t contribute too heavily towards shifting the distribution in a way that is bad.
This is an interesting approach. I’d have to think about it more, and how it interacts with my example. I remember reading somewhere that researchers once tried to only change one layer at a time, but this ended up being too slow.
I will admit to being imprecise in the way I worded that part. I wanted a way of conveying that the transformation was intended to control the shape of the distribution, in order to make it similar across training steps. A Gaussian is a well behaved shape, which is easy for the layer to have as its distribution.
In fact the original paper responds to this point of yours fairly directly,
As for posting about deep learning, I was just hoping that there would be enough people here who would be interested. Looks like there might be, given that you replied. :)