But if, for any parameter, there’s some probability p that it’s an inconvenient valley instead of a convenient hill, in order to get stuck you need to have ten thousand valleys.
I don’t understand this part. Is this the probability a given point in parameter space being an optima (min or max)? Is this the probability of a point, given that it is a critical or near-critical point, being an optima?
Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points, and a fair amount of analysis has gone into what to do about that.
This paper argues that the issue is saddle points https://arxiv.org/pdf/1406.2572.pdf but given how it’s been three years and those methods have not been widely adopted, I don’t think it’s really that much of an issue.
Most of how modern gradient descent is accomplished (e.g. momentum) tends to steamroll over a lot of these problems. Distill has a beautiful set of interactive explanations on how and why momentum affects the gradient descent process here: https://distill.pub/2017/momentum/ I’d highly recommend checking it out.
Additionally, for many deep neural network problems we explicitly don’t want the global optima! This usually corresponds to dramatically overfitting the training distribution/dataset.
Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points
I agree, the point of the digression is that a saddle point is a hill in one direction and a valley in the other.
The point is that because it’s a hill in at least one direction a small perturbation (like the change in your estimate of a cost function from one mini-batch to the next) gets you out of it so it’s not a problem.
Is this the probability of a point, given that it is a critical or near-critical point, being an optima?
there p is the probability that, given a near-critical point, that in a given direction that criticality is hill-like or valley-like. If any of the directions are hill-like you can roll down those directions so you need your critical points to be very valley-like. It’s a stupid computation that isn’t actually well defined (the probability I’m estimating is dumb and I’m only considering one critical point when I should be asking how many points are “near critical” and factoring that in, among other things) so don’t worry too much about it!
I don’t understand this part. Is this the probability a given point in parameter space being an optima (min or max)? Is this the probability of a point, given that it is a critical or near-critical point, being an optima?
Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points, and a fair amount of analysis has gone into what to do about that.
This paper argues that the issue is saddle points https://arxiv.org/pdf/1406.2572.pdf but given how it’s been three years and those methods have not been widely adopted, I don’t think it’s really that much of an issue.
Most of how modern gradient descent is accomplished (e.g. momentum) tends to steamroll over a lot of these problems. Distill has a beautiful set of interactive explanations on how and why momentum affects the gradient descent process here: https://distill.pub/2017/momentum/ I’d highly recommend checking it out.
Additionally, for many deep neural network problems we explicitly don’t want the global optima! This usually corresponds to dramatically overfitting the training distribution/dataset.
I agree, the point of the digression is that a saddle point is a hill in one direction and a valley in the other.
The point is that because it’s a hill in at least one direction a small perturbation (like the change in your estimate of a cost function from one mini-batch to the next) gets you out of it so it’s not a problem.
there p is the probability that, given a near-critical point, that in a given direction that criticality is hill-like or valley-like. If any of the directions are hill-like you can roll down those directions so you need your critical points to be very valley-like. It’s a stupid computation that isn’t actually well defined (the probability I’m estimating is dumb and I’m only considering one critical point when I should be asking how many points are “near critical” and factoring that in, among other things) so don’t worry too much about it!