Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading? It seems like you will end up with bounds on D(μ||ξ) that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like λ≤K going on ultimately.
If I understand correctly you are probably doing something like:
Identified a continuous space W (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes Wcode into W (by encoding the execution of a UTM into the weights of your transformer)
Used p(y|x,w) parametrised by the transformer, where w∈W to provide what I would call a “smooth relaxation” of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary w away from the configuration corresponding to a TM code [M] in a bit of the description that corresponds to unused states or symbols, it can’t affect the execution and so there is degeneracy in the KL divergence K
Hence, λ([M])≤len([M]) and if then repeating this over all TMs M which perfectly fit the given data distribution, we get a bound on the global λ≤K.
Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.
Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing D(μ||ξ) and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis h expressed on two different UTMs.
If we have a UTM M1 and a different UTM M2, we know that K(h,M2)≤K(h,M1)+K(M1,M2), because if nothing else we can simulate UTM M1 on UTM M2 and compute h on the simulated M1. But in real life, we’d usually expect the actual shortest program that implements h on M1 to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement h on the NN by simulating a UTM, and running h on that simulated UTM. So the lowest LLC parameter configuration that implements h on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements h through this simulation route. Or that was the intuition I had starting out anyway.
If I understand correctly you are probably doing something like:
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before len([M]). Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a 12 here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
This post is my current early-stage sketch of the proof idea. Don’t take it too seriously yet. I’m writing this out mostly to organise my own thoughts.
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
you are treating the iid case
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs x to be independent of each other. It does assume that each label y is a function of its corresponding input x rather than some other input. So, label y1 can depend on input x1, but it can only depend on x2 in a manner mediated by x1. In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the f prefactactor to K(μ,M) in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′K(μ,M) is related to the number of bits in the parameters we need to fix to implement μ on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.
Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that λ in your setting is “sort of” K-complexity is a bit misleading? It seems like you will end up with bounds on D(μ||ξ) that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like λ≤K going on ultimately.
If I understand correctly you are probably doing something like:
Identified a continuous space W (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes Wcode into W (by encoding the execution of a UTM into the weights of your transformer)
Used p(y|x,w) parametrised by the transformer, where w∈W to provide what I would call a “smooth relaxation” of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary w away from the configuration corresponding to a TM code [M] in a bit of the description that corresponds to unused states or symbols, it can’t affect the execution and so there is degeneracy in the KL divergence K
Hence, λ([M])≤len([M]) and if then repeating this over all TMs M which perfectly fit the given data distribution, we get a bound on the global λ≤K.
Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.
Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing D(μ||ξ) and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.
From your message it seems like you think the global learning coefficient might be lower than K, but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis h expressed on two different UTMs.
If we have a UTM M1 and a different UTM M2, we know that K(h,M2)≤K(h,M1)+K(M1,M2), because if nothing else we can simulate UTM M1 on UTM M2 and compute h on the simulated M1. But in real life, we’d usually expect the actual shortest program that implements h on M1 to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement h on the NN by simulating a UTM, and running h on that simulated UTM. So the lowest LLC parameter configuration that implements h on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements h through this simulation route. Or that was the intuition I had starting out anyway.
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before len([M]). Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a 12 here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs x to be independent of each other. It does assume that each label y is a function of its corresponding input x rather than some other input. So, label y1 can depend on input x1, but it can only depend on x2 in a manner mediated by x1. In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the f prefactactor to K(μ,M) in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′K(μ,M) is related to the number of bits in the parameters we need to fix to implement μ on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.