Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity.
I’m very confused about this myself, having just read this introductory paper on the subject.
My understanding is that an “ideal” reference UTM would be a universal turing machine with no bias towards any arbitrary string, but rigorously defining such a machine is an open problem. Based on our observation of UTMs, the more arbitrary simplifications a Turing Machine makes, the longer its compiler will have to be on other UTMs. This is called the Short Compiler Assumption. Since we can’t agree on a single ideal reference UTM, we instead approximate it by limiting ourselves to a class of “natural” UTMs which are mutually interpretable within a constant. The smaller the constant, the less arbitrary simplification the UTMs in the class will tend to make.
This seems to mesh with the sequences post on Occam’s Razor:
What if you don’t like Turing machines? Then there’s only a constant complexity penalty to design your own Universal Turing Machine that interprets whatever code you give it in whatever programming language you like.
What I’m confused about is this constant penalty. Is it just “some constant” or is it knowable and small? Is there a specific UTM for which we can always write a short compiler on any other UTM?
I’m getting out of my league here, but I’d guess that there is an upper bound on how complex you can make certain instructions across all UTMs because UTMs must (a) be finite, and (b) at the lowest level implement a minimal set of instructions, including a functionally full set of logical connectives. So for example, say I take as my “nonbiased” UTM a UTM that aside from the elementary operations of the machine on its tape, jump instructions, etc. has only a minimal number of instructions implementing a minimally complete operator set with less than two connectives: {NAND} or {NOR}. My understanding is that anything that’s a Universal Turing Machine is going to have to itself have a small number of instructions that implement the basic machine instructions and a complete set of connectives somewhere in its instruction set, and converting between {NAND} or {NOR} and any other complete set of connectives can be done with a trivially short encoding.
Since we can’t agree on a single ideal reference UTM, we instead approximate it by limiting ourselves to a class of “natural” UTMs which are mutually interpretable within a constant.
Even if you do that, you’re left with an infinite number of cliques such that within any given clique the languages can write short interpreters for each other. Picking one of the cliques is just as arbitrary as picking a single language to begin with.
i.e. for any given class of what-we-intuitively-think-of-as-complicated programs X, you can design a language that can concisely represent members of X and can concisely represent interpreters with this special case.
What I’m confused about is this constant penalty. Is it just “some constant” or is it knowable and small?
It’s a function of the language you’re writing an interpreter of and the language you’re writing it in. “Constant” in that it doesn’t depend on the programs you’re going to run in the new language.
i.e. for any given pair of languages there’s a finite amount of disagreement between those two versions of the Solomonoff prior; but for any given number there are pairs of languages that disagree by more than that.
Is there a specific UTM for which we can always write a short compiler on any other UTM?
No. There’s no law against having a gigabyte-long opcode for NAND, and using all the shorter opcodes for things-we-intuitively-think-of-as-complicated.
So is there then a pragmatic assumption that can be made? Maybe we assume that if I pick a turing machine blindly, without specifically designing it for a particular output string, it’s unlikely to be strongly biased towards that string.
What probability distribution over turing machines do you blindly pick it from? That’s another instance of the same problem.
Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don’t really doubt that it’s within a few thousand bits of the “right” version of solomonoff, whatever that means.
Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don’t really doubt that it’s within a few thousand bits of the “right” version of solomonoff, whatever that means.
I’m very confused about this myself, having just read this introductory paper on the subject.
My understanding is that an “ideal” reference UTM would be a universal turing machine with no bias towards any arbitrary string, but rigorously defining such a machine is an open problem. Based on our observation of UTMs, the more arbitrary simplifications a Turing Machine makes, the longer its compiler will have to be on other UTMs. This is called the Short Compiler Assumption. Since we can’t agree on a single ideal reference UTM, we instead approximate it by limiting ourselves to a class of “natural” UTMs which are mutually interpretable within a constant. The smaller the constant, the less arbitrary simplification the UTMs in the class will tend to make.
This seems to mesh with the sequences post on Occam’s Razor:
What I’m confused about is this constant penalty. Is it just “some constant” or is it knowable and small? Is there a specific UTM for which we can always write a short compiler on any other UTM?
I’m getting out of my league here, but I’d guess that there is an upper bound on how complex you can make certain instructions across all UTMs because UTMs must (a) be finite, and (b) at the lowest level implement a minimal set of instructions, including a functionally full set of logical connectives. So for example, say I take as my “nonbiased” UTM a UTM that aside from the elementary operations of the machine on its tape, jump instructions, etc. has only a minimal number of instructions implementing a minimally complete operator set with less than two connectives: {NAND} or {NOR}. My understanding is that anything that’s a Universal Turing Machine is going to have to itself have a small number of instructions that implement the basic machine instructions and a complete set of connectives somewhere in its instruction set, and converting between {NAND} or {NOR} and any other complete set of connectives can be done with a trivially short encoding.
Even if you do that, you’re left with an infinite number of cliques such that within any given clique the languages can write short interpreters for each other. Picking one of the cliques is just as arbitrary as picking a single language to begin with. i.e. for any given class of what-we-intuitively-think-of-as-complicated programs X, you can design a language that can concisely represent members of X and can concisely represent interpreters with this special case.
It’s a function of the language you’re writing an interpreter of and the language you’re writing it in. “Constant” in that it doesn’t depend on the programs you’re going to run in the new language. i.e. for any given pair of languages there’s a finite amount of disagreement between those two versions of the Solomonoff prior; but for any given number there are pairs of languages that disagree by more than that.
No. There’s no law against having a gigabyte-long opcode for NAND, and using all the shorter opcodes for things-we-intuitively-think-of-as-complicated.
So is there then a pragmatic assumption that can be made? Maybe we assume that if I pick a turing machine blindly, without specifically designing it for a particular output string, it’s unlikely to be strongly biased towards that string.
What probability distribution over turing machines do you blindly pick it from? That’s another instance of the same problem.
Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don’t really doubt that it’s within a few thousand bits of the “right” version of solomonoff, whatever that means.
Why not?