That proof of the instability of RNNs is very nice.
The version of the vanishing gradient problem I learned is simply that if you’re updating weights proportional to the gradient, then if your average weight somehow ends up as 0.98, as you increase the number of layers your gradient, and therefore your update size, will shrink kind of like (0.98)^n, which is not the behavior you want it to have.
That proof of the instability of RNNs is very nice.
Great, thanks. It is adapted from Goodfellow et al.’s discussion of the topic, which I cite in the post.
The version of the vanishing gradient problem I learned is simply that if you’re updating weights proportional to the gradient, then if your average weight somehow ends up as 0.98, as you increase the number of layers your gradient, and therefore your update size, will shrink kind of like (0.98)^n, which is not the behavior you want it to have.
That makes sense. However, Goodfellow et al. argue that this isn’t a big issue for non-RNNs. Their discussion is a bit confusing to me so I’ll just leave it below,
This problem is particular to recurrent networks. In the scalar case, imagine multiplying a weight w by itself many times. The product wt will either vanish or explode depending on the magnitude of w. However, if we make a non-recurrent network that has a different weight w(t) at each time step, the situation is different. If the initial state is given by 1, then the state at time t is given by ∏tw(t). Suppose that the w(t) values are generated randomly, independently from one another, with zero mean and variance v. The variance of the product is O(vn). To obtain some desired variance v∗ we may choose the individual weights with variance v=n√v∗. Very deep feedforward networks with carefully chosen scaling can thus avoid the vanishing and exploding gradient problem, as argued by Sussillo (2014).
That proof of the instability of RNNs is very nice.
The version of the vanishing gradient problem I learned is simply that if you’re updating weights proportional to the gradient, then if your average weight somehow ends up as 0.98, as you increase the number of layers your gradient, and therefore your update size, will shrink kind of like (0.98)^n, which is not the behavior you want it to have.
Great, thanks. It is adapted from Goodfellow et al.’s discussion of the topic, which I cite in the post.
That makes sense. However, Goodfellow et al. argue that this isn’t a big issue for non-RNNs. Their discussion is a bit confusing to me so I’ll just leave it below,