GPT-3 recognizes 50k possible tokens. For a 1000 token context window that means there are (5⋅105)103≈105000 possible prompts. Astronomically large. If we assume the output of a single run of gpt is 200 tokens then for each possible prompt there are ≈102500 possible continuations.
GPT-3 is probabilistic, defining for each possible prompt x (≈105000) a distribution q(x) on a set of size 102500, in other words a 102500−1 dimensional space. [1]
Mind-boggingly large. Compared to these numbers the amount of data (40 trillion tokens??) and the size of the model (175 billion parameters) seems absolutely puny in comparison.
I won’t be talking about the data, or ‘overparameterizations’ in this short, that is well-explained by Singular Learning Theory. Instead, I will be talking about nonrealizability.
Nonrealizability & the structure of natural data
Recall the setup of (parametric) Bayesian learning: there is a sample space Ω, a true distribution q(x) on Ω and a parameterized family of probability distributions p(x|w),w∈W⊂Rd.
It is often assumed that the true distribution is ‘realizable’, i.e.q(x)=p(x|w0) for some w0. Seeing the numbers in the previous section this assumption seems dubious but the situation becomes significantly easier to analyze, both conceptually and mathematically when we assume realizability.
Conceptually, if the space of possible true distributions is very large compared to the space of model parameters we may ask: how do we know that the true distribution is in the model (or can be well-approximated by it?).
One answer one hears often is the ‘universal approximation theorem’ (i.e. the Stone-Weierstrass theorem). I’ll come back to this shortly.
Another point of view is that real data sets are actually localized in a very low dimensional subset of all possible data .[2] Following this road leads to theories of lossless compressions, cf sparse coding and compressed sensing which are of obvious important to interpreting modern neural networks.
That is lossless compression, but another side of the coin is lossy compression.
Fractals and lossy compression
GPT-3 has 175 billion parameters, but the space of possible continuations is many times larger >>101000 . Even if we sparse coding implies that the effective dimensionality is much smaller—is it really small enough?
Whenever we have a lower-dimensional subspace W of a larger dimensional subspace there are points y in the larger dimensional space that are very (even arbitrarily) far from W. This is easy to see in the linear case but also true if W is more like a manifold[3]- the volume of a lower dimensional space is vanishly small compared to the higher dimensional space. It’s a simple mathematical fact that can’t be denied!
Unless… W is a fractal.
This is from Marzen & Crutchfield’s “Nearly Maximally Predictive Features and Their Dimensions”. The setup is Crutchfield Computational Mechanics, whose central characters are Hidden Markov Models. I won’t go into the details here [but give it a read!].
The conjecture is the following: a ‘good architecture’ defines a model space W that is effectively a fractal in much larger-dimensional space of Δk realistic data distributions such that for any possible true distribution q(x)∈Δk the KL-divergence minw∈WK(w)=K(wopt)≤ϵ for some small ϵ .
Grokking
Phase transitions in loss when varying model size are designated ‘grokking’. We can combine the fractal data manifold hypothesis with a SLT-perspective: as we scale up the model, the set of optimal parameter Wopt become better and better. It could happen that the model size gets big enough that it includes a whole new phase, meaning a wopt with radically lower loss K and higher λ.
EDIT: seems I’m confused about the nomenclature. Grokking doesn’t refer to phase transitions in model size, but in training and data size.
EDIT2: Seems I’m not crazy. Thanks to Matt Farugia to pointing me towards this result: neural networks are strongly nonconvex (i.e. fractal)
EDIT: seems to me that there is another point of contention in which universal approximation theorems (Stone-Weierstrass theorem) are misleading. The stone-Weierstrass applies to a subalgebra of the continuous functions. Seems to me that in the natural parameterization ReLU neural networks aren’t a subalgebra of the continuous functions (see also the nonconvexity above).
Very interesting, glad to see this written up! Not sure I totally agree that it’s necessary for W to be a fractal? But I do think you’re onto something.
In particular you say that “there are points y in the larger dimensional space that are very (even arbitrarily) far from W,” but in the case of GPT-4 the input space is discrete, and even in the case of e.g. vision models the input space is compact. So the distance must be bounded.
Plus if you e.g. sample a random image, you’ll find there’s usually a finite distance you need to travel in the input space (in L1, L2, etc) until you get something that’s human interpretable (i.e. lies on the data manifold). So that would point against the data manifold being dense in the input space.
But there is something here, I think. The distance usually isn’t that large until you reach a human interpretable image, and it’s quite easy to perturb images slightly to have completely different interpretations (both to humans and ML systems). A fairly smooth data manifold wouldn’t do this. So my guess is that the data “manifold” is in fact not a manifold globally, but instead has many self-intersections and is singular. That would let it be close to large portions of input space without being literally dense in it. This also makes sense from an SLT perspective. And IIRC there’s some empirical evidence that the dimension of the data “manifold” is not globally constant.
The input and output spaces etc Ω are all discrete but the spaces of distributions Δ(Ω) on those spaces are infinite (but still finite-dimensional).
It depends on what kind of metric one uses, compactness assumptions etc whether or not you can be arbitrarily far. I am being rather vague here. For instance, if you use the KL-divergence, then K(q|puniform) is always bounded - indeed it equals the entropy of the true distribution H(q)!
I don’t really know what ML people mean by the data manifold so won’t say more about that.
I am talking about the space W of parameter values of a conditional probability distribution p(x|w).
I think that W having nonconstant local dimension doesn’t seem that relevant since the largest dimensional subspace would dominate?
Self-intersections and singularities could certainly occur here. (i) singularities in the SLT sense have to do with singularities in the level sets of the KL-divergence (or loss function) - don’t see immediately how these are related to the singularities that you are talking about here (ii) it wouldn’t increase the dimensionality (rather the opposite).
The fractal dimension is important basically because of space-filling curves : a space that has a low-dimensional parameterization can nevertheless have a very large effective dimensions when embedded fractally into a larger-dimensional space. These embeddings can make a low-dimensional parameterization effectively have higher dimension.
Sorry, I realized that you’re mostly talking about the space of true distributions and I was mainly talking about the “data manifold” (related to the structure of the map x↦p(x∣w∗) for fixed w∗). You can disregard most of that.
Though, even in the case where we’re talking about the space of true distributions, I’m still not convinced that the image of W under p(x∣w) needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument—you’re assuming that the image of W densely (or almost densely) fills the space of all probability distributions of a given dimension. But of course we know that universal approximation is problematic and can’t explain what neural nets are actually doing for realistic data.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Oh I actually don’t think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of W is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm C(X,R), the Banach subalgebra of polynomials is dense in C(X,R). In practice you’d only have a finite-dimensional subset of the polynomials, so this obviously can’t hold exactly, but as you increase the size of the polynomials, they’ll be more space-filling and the error bound will decrease.
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
The problem is that the dimension of W required to achieve a given ϵ error bound grows exponentially with the dimension d of your underlying space X. For instance, if you assume that weights depend continuously on the target function, ϵ-approximating all Cn functions on [0,1]d with Sobolev norm ≤1 provably takes at least O(ϵ−d/n) parameters (DeVore et al.). This is a lower bound.
So for any realistic d universal approximation is basically useless—the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.
Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can’t just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Stone-weierstrass isn’t quantitative. If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Fractal Fuzz: making up for size
GPT-3 recognizes 50k possible tokens. For a 1000 token context window that means there are (5⋅105)103≈105000 possible prompts. Astronomically large. If we assume the output of a single run of gpt is 200 tokens then for each possible prompt there are ≈102500 possible continuations.
GPT-3 is probabilistic, defining for each possible prompt x (≈105000) a distribution q(x) on a set of size 102500, in other words a 102500−1 dimensional space. [1]
Mind-boggingly large. Compared to these numbers the amount of data (40 trillion tokens??) and the size of the model (175 billion parameters) seems absolutely puny in comparison.
I won’t be talking about the data, or ‘overparameterizations’ in this short, that is well-explained by Singular Learning Theory. Instead, I will be talking about nonrealizability.
Nonrealizability & the structure of natural data
Recall the setup of (parametric) Bayesian learning: there is a sample space Ω, a true distribution q(x) on Ω and a parameterized family of probability distributions p(x|w),w∈W⊂Rd.
It is often assumed that the true distribution is ‘realizable’, i.e.q(x)=p(x|w0) for some w0. Seeing the numbers in the previous section this assumption seems dubious but the situation becomes significantly easier to analyze, both conceptually and mathematically when we assume realizability.
Conceptually, if the space of possible true distributions is very large compared to the space of model parameters we may ask: how do we know that the true distribution is in the model (or can be well-approximated by it?).
One answer one hears often is the ‘universal approximation theorem’ (i.e. the Stone-Weierstrass theorem). I’ll come back to this shortly.
Another point of view is that real data sets are actually localized in a very low dimensional subset of all possible data .[2] Following this road leads to theories of lossless compressions, cf sparse coding and compressed sensing which are of obvious important to interpreting modern neural networks.
That is lossless compression, but another side of the coin is lossy compression.
Fractals and lossy compression
GPT-3 has 175 billion parameters, but the space of possible continuations is many times larger >>101000 . Even if we sparse coding implies that the effective dimensionality is much smaller—is it really small enough?
Whenever we have a lower-dimensional subspace W of a larger dimensional subspace there are points y in the larger dimensional space that are very (even arbitrarily) far from W. This is easy to see in the linear case but also true if W is more like a manifold[3]- the volume of a lower dimensional space is vanishly small compared to the higher dimensional space. It’s a simple mathematical fact that can’t be denied!
Unless… W is a fractal.
This is from Marzen & Crutchfield’s “Nearly Maximally Predictive Features and Their Dimensions”. The setup is Crutchfield Computational Mechanics, whose central characters are Hidden Markov Models. I won’t go into the details here [but give it a read!].
The conjecture is the following: a ‘good architecture’ defines a model space W that is effectively a fractal in much larger-dimensional space of Δk realistic data distributions such that for any possible true distribution q(x)∈Δk the KL-divergence minw∈WK(w)=K(wopt)≤ϵ for some small ϵ .
Grokking
Phase transitions in loss when varying model size are designated ‘grokking’. We can combine the fractal data manifold hypothesis with a SLT-perspective: as we scale up the model, the set of optimal parameter Wopt become better and better. It could happen that the model size gets big enough that it includes a whole new phase, meaning a wopt with radically lower loss K and higher λ.
EDIT: seems I’m confused about the nomenclature. Grokking doesn’t refer to phase transitions in model size, but in training and data size.
EDIT2: Seems I’m not crazy. Thanks to Matt Farugia to pointing me towards this result: neural networks are strongly nonconvex (i.e. fractal)
EDIT: seems to me that there is another point of contention in which universal approximation theorems (Stone-Weierstrass theorem) are misleading. The stone-Weierstrass applies to a subalgebra of the continuous functions. Seems to me that in the natural parameterization ReLU neural networks aren’t a subalgebra of the continuous functions (see also the nonconvexity above).
To think about: information dimension
Why the −1? Think visually about the set of distributions on 3 points. Hint: it’s a solid triangle (‘2- simplex’)
I think MLers call this the ‘data manifold’?
In the mathematical sense of smooth manifold, not the ill-defined ML notion of ‘data manifold’.
Very interesting, glad to see this written up! Not sure I totally agree that it’s necessary for W to be a fractal? But I do think you’re onto something.
In particular you say that “there are points y in the larger dimensional space that are very (even arbitrarily) far from W,” but in the case of GPT-4 the input space is discrete, and even in the case of e.g. vision models the input space is compact. So the distance must be bounded.
Plus if you e.g. sample a random image, you’ll find there’s usually a finite distance you need to travel in the input space (in L1, L2, etc) until you get something that’s human interpretable (i.e. lies on the data manifold). So that would point against the data manifold being dense in the input space.
But there is something here, I think. The distance usually isn’t that large until you reach a human interpretable image, and it’s quite easy to perturb images slightly to have completely different interpretations (both to humans and ML systems). A fairly smooth data manifold wouldn’t do this. So my guess is that the data “manifold” is in fact not a manifold globally, but instead has many self-intersections and is singular. That would let it be close to large portions of input space without being literally dense in it. This also makes sense from an SLT perspective. And IIRC there’s some empirical evidence that the dimension of the data “manifold” is not globally constant.
The input and output spaces etc Ω are all discrete but the spaces of distributions Δ(Ω) on those spaces are infinite (but still finite-dimensional).
It depends on what kind of metric one uses, compactness assumptions etc whether or not you can be arbitrarily far. I am being rather vague here. For instance, if you use the KL-divergence, then K(q|puniform) is always bounded - indeed it equals the entropy of the true distribution H(q)!
I don’t really know what ML people mean by the data manifold so won’t say more about that.
I am talking about the space W of parameter values of a conditional probability distribution p(x|w).
I think that W having nonconstant local dimension doesn’t seem that relevant since the largest dimensional subspace would dominate?
Self-intersections and singularities could certainly occur here. (i) singularities in the SLT sense have to do with singularities in the level sets of the KL-divergence (or loss function) - don’t see immediately how these are related to the singularities that you are talking about here (ii) it wouldn’t increase the dimensionality (rather the opposite).
The fractal dimension is important basically because of space-filling curves : a space that has a low-dimensional parameterization can nevertheless have a very large effective dimensions when embedded fractally into a larger-dimensional space. These embeddings can make a low-dimensional parameterization effectively have higher dimension.
Sorry, I realized that you’re mostly talking about the space of true distributions and I was mainly talking about the “data manifold” (related to the structure of the map x↦p(x∣w∗) for fixed w∗). You can disregard most of that.
Though, even in the case where we’re talking about the space of true distributions, I’m still not convinced that the image of W under p(x∣w) needs to be fractal. Like, a space-filling assumption sounds to me like basically a universal approximation argument—you’re assuming that the image of W densely (or almost densely) fills the space of all probability distributions of a given dimension. But of course we know that universal approximation is problematic and can’t explain what neural nets are actually doing for realistic data.
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Curious what’s your beef with universal approximation? Stone-weierstrass isn’t quantitative—is that the reason?
If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.
Oh I actually don’t think this is speculation, if (big if) you satisfy the conditions for universal approximation then this is just true (specifically that the image of W is dense in function space). Like, for example, you can state Stone-Weierstrass as: for a Hausdorff space X, and the continuous functions under the sup norm C(X,R), the Banach subalgebra of polynomials is dense in C(X,R). In practice you’d only have a finite-dimensional subset of the polynomials, so this obviously can’t hold exactly, but as you increase the size of the polynomials, they’ll be more space-filling and the error bound will decrease.
The problem is that the dimension of W required to achieve a given ϵ error bound grows exponentially with the dimension d of your underlying space X. For instance, if you assume that weights depend continuously on the target function, ϵ-approximating all Cn functions on [0,1]d with Sobolev norm ≤1 provably takes at least O(ϵ−d/n) parameters (DeVore et al.). This is a lower bound.
So for any realistic d universal approximation is basically useless—the number of parameters required is enormous. Which makes sense because approximation by basis functions is basically the continuous version of a lookup table.
Because neural networks actually work in practice, without requiring exponentially many parameters, this also tells you that the space of realistic target functions can’t just be some generic function space (even with smoothness conditions), it has to have some non-generic properties to escape the lower bound.
Ooooo okay so this seems like it’s directly pointing to the fractal story! Exciting!
Obviously this is all speculation but maybe I’m saying that the universal approximation theorem implies that neural architectures are fractal in space of all distributtions (or some restricted subset thereof)?
Stone-weierstrass isn’t quantitative. If true it suggest the fractal dimension (probably related to the information dimension I linked to above) may be important.