I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the “chain” will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.
The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.
A minor point: the gradient doesn’t necessarily tend towards zero as you get closer to a local minimum, that depends on the higher order derivatives. Imagine a local minimum at the bottom of a funnel or spike, for instance—or a very spiky fractal-like landscape. On the other hand, a local minimum in a region with a small gradient is a desirable property, since it means small perturbations in the input data doesn’t change the output much. But this point will be difficult to reach, since learning depends on the gradient...
(Thanks for the interesting analysis, I’m happy to discuss this but probably won’t drop by regularly to check comments—feel free to email me at ketil at malde point org)
I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function.
That’s what I used to think too. :)
If you look at the post above, I even linked to the reason why I thought that. In particular, vanishing gradients was taught as intrinsically related to the sigmoid function in page 105 in these lecture notes, which is where I initially learned about the problem.
However, I no longer think gradient vanishing is fundamentally linked to sigmoids or tanh activations.
I think that there is probably some confusion in terminology, and some people use the the words differently than others. If we look in the Deep Learning Book, there are two sections that talk about the problem, namely section 8.2.5 and section 10.7, neither of which bring up sigmoids as being related (though they do bring up deep weight sharing networks). Goodfellow et al. cite Sepp Hochreiter’s 1991 thesis as being the original document describing the issue, but unfortunately it’s in German so I cannot comment as to whether it links the issue to sigmoids.
Currently, when I Ctrl-F “sigmoid” on the Wikipedia page for vanishing gradients, there are no mentions. There is a single subheader which states, “Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.” However, the citation for this statement comes from this paper which mentions vanishing gradients only once and explicitly states,
We can see the model as an exponential number of linear models that share parameters (Nair and Hinton, 2010). Because of this linearity, gradients flow well on the active paths of neurons (there is no gradient vanishing effect due to activation non-linearities of sigmoid or tanh units)
(Note: I misread the quote above—I’m still confused).
I think this is quite strong evidence that I was not taught the correct usage of vanishing gradients.
The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.
Interesting you say that. I actually wrote a post on rethinking batch normalization, and I no longer think it’s justified to say that batch normalization simply mitigates vanishing gradients. The exact way that batch normalization works is a bit different, and it would be inaccurate to describe it as an explicit strategy to reduce vanishing gradients (although it may help. Funny enough the original batch normalization paper says that with batchnorm they were able to train with sigmoids easier).
A minor point: the gradient doesn’t necessarily tend towards zero as you get closer to a local minimum, that depends on the higher order derivatives.
True. I had a sort of smooth loss function in my head.
I think this is quite strong evidence that I was not taught the correct usage of vanishing gradients.
I’m very confused. The way I’m reading the quote you provided, it says ReLu works better because it doesn’t have the gradient vanishing effect that sigmoid and tanh have.
I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the “chain” will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.
The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.
A minor point: the gradient doesn’t necessarily tend towards zero as you get closer to a local minimum, that depends on the higher order derivatives. Imagine a local minimum at the bottom of a funnel or spike, for instance—or a very spiky fractal-like landscape. On the other hand, a local minimum in a region with a small gradient is a desirable property, since it means small perturbations in the input data doesn’t change the output much. But this point will be difficult to reach, since learning depends on the gradient...
(Thanks for the interesting analysis, I’m happy to discuss this but probably won’t drop by regularly to check comments—feel free to email me at ketil at malde point org)
That’s what I used to think too. :)
If you look at the post above, I even linked to the reason why I thought that. In particular, vanishing gradients was taught as intrinsically related to the sigmoid function in page 105 in these lecture notes, which is where I initially learned about the problem.
However, I no longer think gradient vanishing is fundamentally linked to sigmoids or tanh activations.
I think that there is probably some confusion in terminology, and some people use the the words differently than others. If we look in the Deep Learning Book, there are two sections that talk about the problem, namely section 8.2.5 and section 10.7, neither of which bring up sigmoids as being related (though they do bring up deep weight sharing networks). Goodfellow et al. cite Sepp Hochreiter’s 1991 thesis as being the original document describing the issue, but unfortunately it’s in German so I cannot comment as to whether it links the issue to sigmoids.
Currently, when I Ctrl-F “sigmoid” on the Wikipedia page for vanishing gradients, there are no mentions. There is a single subheader which states, “Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.” However, the citation for this statement comes from this paper which mentions vanishing gradients only once and explicitly states,
(Note: I misread the quote above—I’m still confused).
I think this is quite strong evidence that I was not taught the correct usage of vanishing gradients.
Interesting you say that. I actually wrote a post on rethinking batch normalization, and I no longer think it’s justified to say that batch normalization simply mitigates vanishing gradients. The exact way that batch normalization works is a bit different, and it would be inaccurate to describe it as an explicit strategy to reduce vanishing gradients (although it may help. Funny enough the original batch normalization paper says that with batchnorm they were able to train with sigmoids easier).
True. I had a sort of smooth loss function in my head.
I’m very confused. The way I’m reading the quote you provided, it says ReLu works better because it doesn’t have the gradient vanishing effect that sigmoid and tanh have.
Interesting. I just re-read it and you are completely right. Well I wonder how that interacts with what I said above.