This argument doesn’t seem to take into account selection bias.
We don’t get into a local optimum becuase we picked a random point and wow, it’s a local optimum, what are the odds!?
We get into a local optimum because we used an algorithm that specifically *finds* local optima. If they’re still there in higher dimensions then we’re still liable to fall into them rather than into the global optimum.
Oh, well in that case the point isn’t subtlely lacking, it’s just easily disproven. Given any function from I^N to R, I can take the tensor product with cos(k pi x) and get a new function from I^{N+1} to R which has k times as many non-globally-optimal local optima. Pick a decent k and iterate, and you can see the number growing exponentially with higher dimension, not approaching 0.
Perhaps there’s something special about the functions we try to optimize in deep learning, a property that rules out such cases? That could be. But you’ve said nothing special about (or even defined) a particular class of deep learning problems, rather you’ve made a claim about all higher dimensional optimization problems, a claim which has an infinite number of counterexamples.
I definitely intended the implied context to be ‘problems people actually use deep learning for,’ which does impose constraints which I think are sufficient.
Certainly the claim I’m making isn’t true of literally all functions on high dimensional spaces. And if I actually cared about all functions, or even all continuous functions, on these spaces then I believe there are no free lunch theorems that prevent machine learning from being effective at all (e.g. what about those functions that have a vast number of huge oscillations right between those two points you just measured?!)
But in practice deep learning is applied to problems that humans care about. Computer vision and robotics control problems, for example, are very widely used. In these problems there are some distributions of functions that empirically exist, and a simple model of those types of problems is that they can be locally approximated over an area with positive size by taylor series at any point of the domain that you care about, but these local areas are stitched together essentially at random.
In that context, it makes sense that maybe the directional second derivatives of a function would be independent of one another and rarely would they all line up.
Beyond that I’d expect that if you impose a measure on the space of such functions in some way (maybe limiting by number of patches and growth rate of power series coefficients) that the density of functions with even one critical point would quickly approach zero, even while infinitely many such functions exist in an absolute sense.
I got a little defensive thinking about this since I felt like the context of ‘deep learning as it is practiced in real life’ was clear but looking back at the original post it maybe wasn’t outlined in that way. Even so I think your reply feels disingenuous because you’re explicitly constructing adversarial examples rather than sampling functions from some space to suggest that functions with many local optima are “common.” If I start suggesting that deep learning is robust to adversarial examples I have much deeper problems.
This argument doesn’t seem to take into account selection bias.
We don’t get into a local optimum becuase we picked a random point and wow, it’s a local optimum, what are the odds!?
We get into a local optimum because we used an algorithm that specifically *finds* local optima. If they’re still there in higher dimensions then we’re still liable to fall into them rather than into the global optimum.
The point is that they really are NOT still there in higher dimensions.
Oh, well in that case the point isn’t subtlely lacking, it’s just easily disproven. Given any function from I^N to R, I can take the tensor product with cos(k pi x) and get a new function from I^{N+1} to R which has k times as many non-globally-optimal local optima. Pick a decent k and iterate, and you can see the number growing exponentially with higher dimension, not approaching 0.
Perhaps there’s something special about the functions we try to optimize in deep learning, a property that rules out such cases? That could be. But you’ve said nothing special about (or even defined) a particular class of deep learning problems, rather you’ve made a claim about all higher dimensional optimization problems, a claim which has an infinite number of counterexamples.
I definitely intended the implied context to be ‘problems people actually use deep learning for,’ which does impose constraints which I think are sufficient.
Certainly the claim I’m making isn’t true of literally all functions on high dimensional spaces. And if I actually cared about all functions, or even all continuous functions, on these spaces then I believe there are no free lunch theorems that prevent machine learning from being effective at all (e.g. what about those functions that have a vast number of huge oscillations right between those two points you just measured?!)
But in practice deep learning is applied to problems that humans care about. Computer vision and robotics control problems, for example, are very widely used. In these problems there are some distributions of functions that empirically exist, and a simple model of those types of problems is that they can be locally approximated over an area with positive size by taylor series at any point of the domain that you care about, but these local areas are stitched together essentially at random.
In that context, it makes sense that maybe the directional second derivatives of a function would be independent of one another and rarely would they all line up.
Beyond that I’d expect that if you impose a measure on the space of such functions in some way (maybe limiting by number of patches and growth rate of power series coefficients) that the density of functions with even one critical point would quickly approach zero, even while infinitely many such functions exist in an absolute sense.
I got a little defensive thinking about this since I felt like the context of ‘deep learning as it is practiced in real life’ was clear but looking back at the original post it maybe wasn’t outlined in that way. Even so I think your reply feels disingenuous because you’re explicitly constructing adversarial examples rather than sampling functions from some space to suggest that functions with many local optima are “common.” If I start suggesting that deep learning is robust to adversarial examples I have much deeper problems.