[tl;dr the loss landscape around a set of weights encoding an unlearnable ‘pseudorandom’ function will be warped in such a way that gradient optimizers will bob around for exponentially long. ]
Unlearnable Functions: Sample Complexity and Time Complexity
Computational learning theory contains numerous ‘no-go’ results indicating that many functions are not tractably learnable.
The most classical result is probably the VC dimension and PAC learnability. A good example to think about are parity functions. The output is, in some sense, very (maximally) sensitive to the input*. Effectively, this means that we need a lot of data points to pin down the functions. These functions require an intractable number of data points to learn effectively due to their high sample complexity.
Using pseudorandom generators (e.g., Blum-Blum-Shub), one can similarly construct functions that need a superpolynomial amount of time to learn.
Neural networks are very expressive. The oft-cited universal approximation theorem states that any function may be approximated by neural networks (but be careful with the required network size and depth). These superpolynomial-hard functions may be expressed by neural networks, but they wouldn’t be learnable due to optimization challenges and the exponential time required.
Gradient Optimizers and Cursed Landscapes
There is some old work by Stuart Kauffman investigating what kinds of landscapes are learnable by optimization processes like evolution by natural selection. The takeaway is that ‘most’ fitness landscapes aren’t navigable by optimization processes akin to gradient descent; there are too many local minima, making efficient search infeasible.
I conjecture that the landscape around a set of weights w∈Ww \in Ww∈W encoding a pseudorandom-hard function fff becomes extremely oscillatory. This warping causes any gradient optimizer (regardless of step size) to get stuck in local minima or wander around for exponentially long periods***.
This has practical implications for reasoning about the inductive bias of Stochastic Gradient Descent (SGD). Most existing convergence proofs for gradient optimizers impose conditions such as Lipschitz continuity on the landscape. Or they will talk about ‘long tails of ϵ\epsilonϵ-approximate saddle points’.
*This is the start of the beautiful field of Fourier analysis of Boolean circuits. Thanks to Jake and Kaarel for introducing me to this circle of ideas recently—it’s very cool!
**Here’s something I am confused about: the Statistical Learning Theory (SLT) point of view would be that parameters encoding these functions would have large λ\lambdaλ (e.g., eigenvalues of the Hessian). I am a little confused about this since λ\lambdaλ is bounded by d2\frac{d}{2}2d, which can be high but not exponentially large. Perhaps this is where the mysterious multiplicity parameter mmm comes in—or lower-order contributions to the free-energy formula.
***Perhaps ‘superpolynomially long’ is a more accurate description.
Pseudorandom warp fields
[tl;dr the loss landscape around a set of weights encoding an unlearnable ‘pseudorandom’ function will be warped in such a way that gradient optimizers will bob around for exponentially long. ]
Unlearnable Functions: Sample Complexity and Time Complexity
Computational learning theory contains numerous ‘no-go’ results indicating that many functions are not tractably learnable.
The most classical result is probably the VC dimension and PAC learnability. A good example to think about are parity functions. The output is, in some sense, very (maximally) sensitive to the input*. Effectively, this means that we need a lot of data points to pin down the functions. These functions require an intractable number of data points to learn effectively due to their high sample complexity.
Using pseudorandom generators (e.g., Blum-Blum-Shub), one can similarly construct functions that need a superpolynomial amount of time to learn.
Neural networks are very expressive. The oft-cited universal approximation theorem states that any function may be approximated by neural networks (but be careful with the required network size and depth). These superpolynomial-hard functions may be expressed by neural networks, but they wouldn’t be learnable due to optimization challenges and the exponential time required.
Gradient Optimizers and Cursed Landscapes
There is some old work by Stuart Kauffman investigating what kinds of landscapes are learnable by optimization processes like evolution by natural selection. The takeaway is that ‘most’ fitness landscapes aren’t navigable by optimization processes akin to gradient descent; there are too many local minima, making efficient search infeasible.
I conjecture that the landscape around a set of weights w∈Ww \in Ww∈W encoding a pseudorandom-hard function fff becomes extremely oscillatory. This warping causes any gradient optimizer (regardless of step size) to get stuck in local minima or wander around for exponentially long periods***.
This has practical implications for reasoning about the inductive bias of Stochastic Gradient Descent (SGD). Most existing convergence proofs for gradient optimizers impose conditions such as Lipschitz continuity on the landscape. Or they will talk about ‘long tails of ϵ\epsilonϵ-approximate saddle points’.
*This is the start of the beautiful field of Fourier analysis of Boolean circuits. Thanks to Jake and Kaarel for introducing me to this circle of ideas recently—it’s very cool!
**Here’s something I am confused about: the Statistical Learning Theory (SLT) point of view would be that parameters encoding these functions would have large λ\lambdaλ (e.g., eigenvalues of the Hessian). I am a little confused about this since λ\lambdaλ is bounded by d2\frac{d}{2}2d, which can be high but not exponentially large. Perhaps this is where the mysterious multiplicity parameter mmm comes in—or lower-order contributions to the free-energy formula.
***Perhaps ‘superpolynomially long’ is a more accurate description.