You either think of the NN weights as a countable set (by e.g. truncating precision “as in real life”) in which case you get something like ∑∞t=1E[st]≤log(w−1μ) but this is sort of weak sauce: you get this for any prior you want to put over your discrete set of NN weights, no implied connection to K-complexity unless you put one in by hand by taking wμ=2−K(μ).
No, you don’t need to put in wμ=2−K(μ) by hand. A uniform prior over NN weights does the job.[1]
The trick is that a transformer run in recurrent mode can
Simulate a (time and space bounded) UTM in a few transformer blocks
Use the other transformer blocks to store program code to feed that UTM as input.
A uniform prior over neural network parameters then effectively implies a uniform prior over programs to run on the simulated UTM, modulo the bit specification cost of the UTM simulator and the storage setup. Because for every bit of program code less we need to store, we get degrees of freedom in the weights. Since induction with a uniform prior on the input strings to a plain monotone UTM effectively gets us a weighting of hypotheses that’s exponential in K-complexity, we’ll get an error bound with a term proportional to K(μ), plus an offset term for specifying the UTM and storage in the transformer weights.
For the sake of concreteness: If I partially adapted your notation, and went to the special case where the data-generating process μ is exactly realisable in the weights of the transformer[2], I’d currrently seem to get a bound D(μ||ξ)≤flog2K(M,F)+flog2(92+52d)K(μ,M)+O(1).[3]
Here, f is the number of bits per neural network parameter[4], K(M,F) is the number of parameters needed to implement the UTM M on the recurrent transformer architecture F, K(μ,M) is the K-complexity of data-generating program μ on the UTM M in bits, and d is the width of the residual stream.
The prefactor (92+52d) is there because my current construction is stupidly inefficient at storing program code in the weights. I think it ought to be possible to do better, and get this down to a 1. Don’t quote me on that though, I don’t have a proof yet.
If we don’t assume realisability, we can instead take any ‘efficient predictor’ program p∗ that is realisable on the transformer, and get the bound
K-complexity usually enters the theory via a choice of prior, and in continuous model classes priors show up in the constant order terms of asymptotic expansions in n.
The result here is exactly that we don’t need to put in the K-complexity[5] via choice of prior. If we’re using a recurrent neural network, the K-complexity is in the prior already, just as it is on a plain monotone UTM. The architecture itself is what implements the bias toward simplicity.
Note also that in the case of continuous parameters, so bits per float f going to infinity, the K-complexity terms in the bound do not become constant order terms, because they have f as a prefactor. This is one way to start seeing that the K-complexity and the learning coefficient are pretty directly related quantities in the setting of recurrent neural networks.
The O(1) terms become increasingly irrelevant as float precision f gets larger. Basically, I’m using large negative biases to zero out storage neurons that are not needed. In the continuum limit, this would make the weights connecting to those neurons degenerate, and we could integrate them out of the measure. But since we’re in the discrete setting, we have to keep track of the fact that very large magnitudes of the weights that overwhelm the negative biases and switch the neuron on again aren’t allowed. This makes our volume of allowed parameter configurations just a little bit smaller.
EDIT: As in I haven’t shown it in the case of finite float precision NN parameters yet. It of course straightforwardly follows in the SLT setting where NN parameters are real numbers and we consider the limit of number of datapoints going to infinity. The shape of the prior can’t matter much there, as you say.
Doesn’t exist.[1] If f is finite, you can insert AIT-style inequalities into the posterior to get bounds like the one I wrote above. This is neat if you e.g. have more than 2f datapoints.
If f is infinite, you probably want to expand in n instead. I haven’t done that yet, but I expect to get a bound that looks a lot like the standard free energy formula, with the K-complexity terms in the bound I wrote above showing up where the learning coefficient would usually be. The f prefactor probably gets swapped out for a logn2.
It’d still be an upper bound, not an equality, just as in AIT. The learning coefficient can still be smaller than this. This makes sense to me. There might be less complicated ways for the transformer to make an efficient prediction than simulating a UTM and running some program on it.
Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading? It seems like you will end up with bounds on D(μ||ξ) that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like λ≤K going on ultimately.
If I understand correctly you are probably doing something like:
Identified a continuous space W (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes Wcode into W (by encoding the execution of a UTM into the weights of your transformer)
Used p(y|x,w) parametrised by the transformer, where w∈W to provide what I would call a “smooth relaxation” of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary w away from the configuration corresponding to a TM code [M] in a bit of the description that corresponds to unused states or symbols, it can’t affect the execution and so there is degeneracy in the KL divergence K
Hence, λ([M])≤len([M]) and if then repeating this over all TMs M which perfectly fit the given data distribution, we get a bound on the global λ≤K.
Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.
Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing D(μ||ξ) and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis h expressed on two different UTMs.
If we have a UTM M1 and a different UTM M2, we know that K(h,M2)≤K(h,M1)+K(M1,M2), because if nothing else we can simulate UTM M1 on UTM M2 and compute h on the simulated M1. But in real life, we’d usually expect the actual shortest program that implements h on M1 to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement h on the NN by simulating a UTM, and running h on that simulated UTM. So the lowest LLC parameter configuration that implements h on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements h through this simulation route. Or that was the intuition I had starting out anyway.
If I understand correctly you are probably doing something like:
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before len([M]). Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a 12 here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
This post is my current early-stage sketch of the proof idea. Don’t take it too seriously yet. I’m writing this out mostly to organise my own thoughts.
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
you are treating the iid case
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs x to be independent of each other. It does assume that each label y is a function of its corresponding input x rather than some other input. So, label y1 can depend on input x1, but it can only depend on x2 in a manner mediated by x1. In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the f prefactactor to K(μ,M) in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′K(μ,M) is related to the number of bits in the parameters we need to fix to implement μ on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.
No, you don’t need to put in wμ=2−K(μ) by hand. A uniform prior over NN weights does the job.[1]
The trick is that a transformer run in recurrent mode can
Simulate a (time and space bounded) UTM in a few transformer blocks
Use the other transformer blocks to store program code to feed that UTM as input.
A uniform prior over neural network parameters then effectively implies a uniform prior over programs to run on the simulated UTM, modulo the bit specification cost of the UTM simulator and the storage setup. Because for every bit of program code less we need to store, we get degrees of freedom in the weights.
Since induction with a uniform prior on the input strings to a plain monotone UTM effectively gets us a weighting of hypotheses that’s exponential in K-complexity, we’ll get an error bound with a term proportional to K(μ), plus an offset term for specifying the UTM and storage in the transformer weights.
For the sake of concreteness: If I partially adapted your notation, and went to the special case where the data-generating process μ is exactly realisable in the weights of the transformer[2], I’d currrently seem to get a bound D(μ||ξ)≤flog2K(M,F)+flog2(92+52d)K(μ,M)+O(1).[3]
Here, f is the number of bits per neural network parameter[4], K(M,F) is the number of parameters needed to implement the UTM M on the recurrent transformer architecture F, K(μ,M) is the K-complexity of data-generating program μ on the UTM M in bits, and d is the width of the residual stream.
The prefactor (92+52d) is there because my current construction is stupidly inefficient at storing program code in the weights. I think it ought to be possible to do better, and get this down to a 1. Don’t quote me on that though, I don’t have a proof yet.
If we don’t assume realisability, we can instead take any ‘efficient predictor’ program p∗ that is realisable on the transformer, and get the bound
D(μ||ξ)≤D(μ||p∗)+flog2K(M,F)+flog2(92+52d)K(p∗,M)+O(1).
So to summarise
The result here is exactly that we don’t need to put in the K-complexity[5] via choice of prior. If we’re using a recurrent neural network, the K-complexity is in the prior already, just as it is on a plain monotone UTM. The architecture itself is what implements the bias toward simplicity.
Note also that in the case of continuous parameters, so bits per float f going to infinity, the K-complexity terms in the bound do not become constant order terms, because they have f as a prefactor. This is one way to start seeing that the K-complexity and the learning coefficient are pretty directly related quantities in the setting of recurrent neural networks.
I expect a Gaussian prior or anything else of the sort probably works as well, and yields a nigh-identical bound. But I haven’t shown that yet.[6]
My actual bound doesn’t need that assumption. Getting rid of the realisability assumption is what the effective predictor stuff is all about.
The O(1) terms become increasingly irrelevant as float precision f gets larger. Basically, I’m using large negative biases to zero out storage neurons that are not needed. In the continuum limit, this would make the weights connecting to those neurons degenerate, and we could integrate them out of the measure. But since we’re in the discrete setting, we have to keep track of the fact that very large magnitudes of the weights that overwhelm the negative biases and switch the neuron on again aren’t allowed. This makes our volume of allowed parameter configurations just a little bit smaller.
So, f=8 for 8-bit floats, f=16 for 16-bit floats, etc. .
Defined relative to a time and space bounded universal Turing machine.
EDIT: As in I haven’t shown it in the case of finite float precision NN parameters yet. It of course straightforwardly follows in the SLT setting where NN parameters are real numbers and we consider the limit of number of datapoints going to infinity. The shape of the prior can’t matter much there, as you say.
Where is n here?
Doesn’t exist.[1] If f is finite, you can insert AIT-style inequalities into the posterior to get bounds like the one I wrote above. This is neat if you e.g. have more than 2f datapoints.
If f is infinite, you probably want to expand in n instead. I haven’t done that yet, but I expect to get a bound that looks a lot like the standard free energy formula, with the K-complexity terms in the bound I wrote above showing up where the learning coefficient would usually be. The f prefactor probably gets swapped out for a logn2.
It’d still be an upper bound, not an equality, just as in AIT. The learning coefficient can still be smaller than this. This makes sense to me. There might be less complicated ways for the transformer to make an efficient prediction than simulating a UTM and running some program on it.
Except for the implicit dependence in D(μ||ξ) and D(μ||p∗), since those are the KL-divergences summed over n datapoints.
Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading? It seems like you will end up with bounds on D(μ||ξ) that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like λ≤K going on ultimately.
If I understand correctly you are probably doing something like:
Identified a continuous space W (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes Wcode into W (by encoding the execution of a UTM into the weights of your transformer)
Used p(y|x,w) parametrised by the transformer, where w∈W to provide what I would call a “smooth relaxation” of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary w away from the configuration corresponding to a TM code [M] in a bit of the description that corresponds to unused states or symbols, it can’t affect the execution and so there is degeneracy in the KL divergence K
Hence, λ([M])≤len([M]) and if then repeating this over all TMs M which perfectly fit the given data distribution, we get a bound on the global λ≤K.
Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.
Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing D(μ||ξ) and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis h expressed on two different UTMs.
If we have a UTM M1 and a different UTM M2, we know that K(h,M2)≤K(h,M1)+K(M1,M2), because if nothing else we can simulate UTM M1 on UTM M2 and compute h on the simulated M1. But in real life, we’d usually expect the actual shortest program that implements h on M1 to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement h on the NN by simulating a UTM, and running h on that simulated UTM. So the lowest LLC parameter configuration that implements h on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements h through this simulation route. Or that was the intuition I had starting out anyway.
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before len([M]). Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a 12 here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs x to be independent of each other. It does assume that each label y is a function of its corresponding input x rather than some other input. So, label y1 can depend on input x1, but it can only depend on x2 in a manner mediated by x1. In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the f prefactactor to K(μ,M) in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′K(μ,M) is related to the number of bits in the parameters we need to fix to implement μ on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.