Hm. Thinking of this in terms of the few relevant projects I’ve worked on, problems with (nominally) 10,000 parameters definitely had plenty of local minima. In retrospect it’s easy to see how. Saddles could be arbitrarily long, where many parameters basically become irrelevant depending on where you’re standing, and the only way out is effectively restarting. More generally, the parameters were very far from independent. Besides the saddles, for example, you had rough clusters of parameters where you’d want all or none but not half to be (say) small in most situations. In other words, the problem wasn’t “really” 10,000-dimensional; we just didn’t know how or where to reduce dimensionality. I wonder how common that is.
Two more thoughts: the above is probably more common in [what I intuitively think of as] “physical” problems where the parameters have some sort of geometric or causal relationship, which is maybe less meaningful for neural networks?
Also, for optimization more broadly, your constraints will give you a way to wind up with many parameters that can’t be changed to decrease your function, without requiring a massive coincidence. (The boundary of the feasible region is lower-dimensional.) Again, I guess not something deep learning has to worry about in full generality.
Hm. Thinking of this in terms of the few relevant projects I’ve worked on, problems with (nominally) 10,000 parameters definitely had plenty of local minima. In retrospect it’s easy to see how. Saddles could be arbitrarily long, where many parameters basically become irrelevant depending on where you’re standing, and the only way out is effectively restarting. More generally, the parameters were very far from independent. Besides the saddles, for example, you had rough clusters of parameters where you’d want all or none but not half to be (say) small in most situations. In other words, the problem wasn’t “really” 10,000-dimensional; we just didn’t know how or where to reduce dimensionality. I wonder how common that is.
Two more thoughts: the above is probably more common in [what I intuitively think of as] “physical” problems where the parameters have some sort of geometric or causal relationship, which is maybe less meaningful for neural networks?
Also, for optimization more broadly, your constraints will give you a way to wind up with many parameters that can’t be changed to decrease your function, without requiring a massive coincidence. (The boundary of the feasible region is lower-dimensional.) Again, I guess not something deep learning has to worry about in full generality.