Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points
I agree, the point of the digression is that a saddle point is a hill in one direction and a valley in the other.
The point is that because it’s a hill in at least one direction a small perturbation (like the change in your estimate of a cost function from one mini-batch to the next) gets you out of it so it’s not a problem.
Is this the probability of a point, given that it is a critical or near-critical point, being an optima?
there p is the probability that, given a near-critical point, that in a given direction that criticality is hill-like or valley-like. If any of the directions are hill-like you can roll down those directions so you need your critical points to be very valley-like. It’s a stupid computation that isn’t actually well defined (the probability I’m estimating is dumb and I’m only considering one critical point when I should be asking how many points are “near critical” and factoring that in, among other things) so don’t worry too much about it!
I agree, the point of the digression is that a saddle point is a hill in one direction and a valley in the other.
The point is that because it’s a hill in at least one direction a small perturbation (like the change in your estimate of a cost function from one mini-batch to the next) gets you out of it so it’s not a problem.
there p is the probability that, given a near-critical point, that in a given direction that criticality is hill-like or valley-like. If any of the directions are hill-like you can roll down those directions so you need your critical points to be very valley-like. It’s a stupid computation that isn’t actually well defined (the probability I’m estimating is dumb and I’m only considering one critical point when I should be asking how many points are “near critical” and factoring that in, among other things) so don’t worry too much about it!