Sorry, I realized that you’re mostly talking about the space of true distributions and I was mainly talking about the “data manifold” (related to the structure of the map x↦p(x∣w∗) for fixed w∗). You can disregard most of that.
Though, even in the case where we’re talking about the space of true distributions, I’m still not convinced that the image of W under p(x∣w) needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument—you’re assuming that the image of W densely (or almost densely) fills the space of all probability distributions of a given dimension. But of course we know that universal approximation is problematic and can’t explain what neural nets are actually doing for realistic data.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Oh I actually don’t think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of W is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm C(X,R), the Banach subalgebra of polynomials is dense in C(X,R). In practice you’d only have a finite-dimensional subset of the polynomials, so this obviously can’t hold exactly, but as you increase the size of the polynomials, they’ll be more space-filling and the error bound will decrease.
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
The problem is that the dimension of W required to achieve a given ϵ error bound grows exponentially with the dimension d of your underlying space X. For instance, if you assume that weights depend continuously on the target function, ϵ-approximating all Cn functions on [0,1]d with Sobolev norm ≤1 provably takes at least O(ϵ−d/n) parameters (DeVore et al.). This is a lower bound.
So for any realistic d universal approximation is basically useless—the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.
Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can’t just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Stone-weierstrass isn’t quantitative. If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Sorry, I realized that you’re mostly talking about the space of true distributions and I was mainly talking about the “data manifold” (related to the structure of the map x↦p(x∣w∗) for fixed w∗). You can disregard most of that.
Though, even in the case where we’re talking about the space of true distributions, I’m still not convinced that the image of W under p(x∣w) needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument—you’re assuming that the image of W densely (or almost densely) fills the space of all probability distributions of a given dimension. But of course we know that universal approximation is problematic and can’t explain what neural nets are actually doing for realistic data.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Oh I actually don’t think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of W is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm C(X,R), the Banach subalgebra of polynomials is dense in C(X,R). In practice you’d only have a finite-dimensional subset of the polynomials, so this obviously can’t hold exactly, but as you increase the size of the polynomials, they’ll be more space-filling and the error bound will decrease.
The problem is that the dimension of W required to achieve a given ϵ error bound grows exponentially with the dimension d of your underlying space X. For instance, if you assume that weights depend continuously on the target function, ϵ-approximating all Cn functions on [0,1]d with Sobolev norm ≤1 provably takes at least O(ϵ−d/n) parameters (DeVore et al.). This is a lower bound.
So for any realistic d universal approximation is basically useless—the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.
Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can’t just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.
Ooooo okay so this seems like it’s directly pointing to the fractal story! Exciting!
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Stone-weierstrass isn’t quantitative. If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.