Ilio answers Why does gradient descent always work on neural networks?

Ilio 21 May 2022 2:50 UTC
6 points
It may be the deepest thing we understand about NN (but I might got stoned for suggesting we actually know the answer). See lalaithion’s link for one way to see it. My own take is as follow:

First, consider how many n-sphere(s) of radius slightly below ¹⁄₂ you can pack in a n-dimensional unit cube. When n is low, « one » is the obvious answer. When n is high, the true answer is different. You can find the demo on internet, and if you’re like me you’ll need some time to accept this strange result. But when you do, you will realize high dimensions means damn big, and that’s the key insight.

Second, consider that training is the same as looking for a n-dimensional point (one dimension for each weight) in a normalized unit cube. Ok, you got it now: gradient-descent (kind of) always work in high dimensions because high dimensions means a damn big number of possible directions and quasi-solutions, so large that by pigeonhole principle you can’t really have dead ends or swamp traps as in low dimensions.

Third, you understand that’s all wrong and you were right from the start: what we thought were solutions frequently present bizarre statistical properties (think adversarial examples) and you need to rethink what generalization means. But that’s for another ref.

https://dl.acm.org/doi/abs/10.1145/3446776