You don’t really use Solomonoff induction with Bayes. It directly gives predictions (and in AIXI, evaluates actions). You could use 2^-len prior on hypotheses in Bayesian inference, but this is not Solomonoff induction, it just looks superficially similar but doesn’t share the properties.
Point entirely retracted. I am completely wrong here—I did not see how it works out to identical if you assign certainty of 1 to data. (didn’t occur to me to assign certainty of 1 to anything). Thanks for corrections.
I haven’t read Solomonoff’s original papers, but e.g. Shane Legg’s paper on Solomonoff induction gives a description that amounts to “use Bayes with a 2^-program_length prior”. How do you see the two differing and (if this isn’t obvious) why does it matter?
Well then you have to be using it on each and every program for each bit of uncertain data. (What is uncertain data? The regular definition can model the errors you might have in measurement apparatus as well; it is not clear what is the advantage of putting the uncertainty outside of the codes themselves)
If I am to assume that the hypotheses do not get down to their minimum length (no exhaustive uncomputable search), then there is a very significant penalty from 2^-length prior for actual theories versus just storing the data. The computable generative methods (such as a human making hypothesis) don’t find shortest codes, they may be generating codes that are e.g. the size of the shortest code squared, or even faster growing function.
edit: Also, btw, on the sum over all programs: consider the program that never runs past N bits so the bits after N do not matter. There is 1 program of length N, 2 programs of length N+1 , 4 programs of length N+2, and so on, that are identical to this program when run. Got to be careful counting and not forget you care about prefixes.
Perhaps I’m being dim, but it doesn’t look as if you answered my question: what differences do you see between “Solomonoff induction” and “Bayesian inference with a 2^-length prior”?
It sounds (but I may be misunderstanding) as if you’re contrasting “choose the shortest program consistent with the data” with “do Bayesian inference with all programs as hypothesis space and the 2^-length prior” in which case you’re presumably proposing that “Solomonoff induction” should mean the former. But—I have now found at least one of Solomonoff’s articles—it’s the latter, not the former, that Solomonoff proposed. (I quote from “A preliminary report on a general theory of inductive inference”, revised version. From the preface: “The main point of the report is Equation (5), Section 11”. From section 12, entitled “An interpretation of equation (5)”: “Equation (5) then enables us to use this model to obtain a priori probabilities to be used in computation of a posteriori probabilities using Bayes’ Theorem.”)
Solomonoff induction is uncomputable and any computable method necessarily fails to find shortest-possible programs; this is well known. If this fact is the basis for your objection to using the term “Solomonoff induction” to denote Bayes-with-length-prior then I’m afraid I don’t understand why; could you explain more?
It’s conventional to assume a prefix-free representation of programs when discussing this sort of formalism. I don’t see that this has anything to do with the distinction (if there is one) between Solomonoff and Bayes-with-length-prior.
Firstly, on what is conventional, it is sadly the case that in the LW conversations observed by me it is not typical that either party recollects the magnitude of the importance of having ALL programs as the hypothesis space, or any other detail, even the detail so important as ‘the output’s beginning is matched against observed data’.
I’m sorry I was thinking in the context whereby the hypothesis space is not complete (note: in the different papers on subjects if i recall correctly the hypothesis can either refer to the final string—the prediction—or to the code). Indeed in the case where it is applied to the complete set of hypotheses, it works fine. It still should be noted though that it is Solomonoff probability (algorithmic probability), that can be used with Bayes, resulting in Solomonoff induction.
It was my mistake with regards to what Solomonoff exactly specified, given that there is a multitude of different equivalent up to constant definitions that end up called same name. By Solomonoff induction I was referring to the one described in the article.
Thanks for correction in any case. This was genuinely an error on my part. Amusingly it was at +1 while noting that MWI (which someone else brought up) doesn’t begin with the observed data, was much more negatively treated.
You discussed this over here too, with greater hostility:
also someone somehow thinks that Solomonoff induction finds probabilities for theories, while it just assigns 2^-length as probability for software code of such length, which is obviously absurd when applied to anything but brute force generated shortest pieces of code,
I’m trying to figure out whether, when you say “someone”, you mean someone upthread or one of the original authors of the post. Because if it’s the post authors, then I get to accuse you of not caring enough about refraining from heaping abuse on writers who don’t deserve it to actually bother to check whether they made the mistake you mocked them for. The post includes discussion of ideals vs. approximations in the recipe metaphor and in the paragraph starting “The actual process above might seem a little underwhelming...”:
We just check every single hypothesis? Really? Isn’t that a little mindless and inefficient? This will certainly not be how the first true AI operates. But don’t forget that before this, nobody had any idea how to do ideal induction, even in principle. Developing fundamental theories, like quantum mechanics, might seem abstract and wasteful. [...] such theories and models change the world, as quantum mechanics did with modern electronics. In the future, [...] ways to approximate Solomonoff induction [...] Perhaps they will develop methods to eliminate large numbers of hypotheses all at once. Maybe hypotheses will be broken into distinct classes. Or maybe they’ll use methods to statistically converge toward the right hypotheses.
Everyone knows there are practical difficulties. Some people expect that someone will almost certainly find a way to mitigate them. You tend to treat people as idiots for going ahead with the version of the discussion that presumes that.
With respect to your objections about undiscoverable short programs, presumably people would look for ways to calibrate the relationship between the description lengths of the theories their procedures do discover (or the lengths of the parts of those theories) and the typical probabilities of observations, perhaps using toy domains where everything is computable and Solomonoff probabilities can be computed exactly. Of course, by diagonalization there are limits to the power of any finite such procedure, but it’s possible to do better than assuming that the only programs are the ones you can find explicitly. Do you mean to make an implicit claim by your mockery which special-cases to the proposition that what makes the post authors deserve to be mocked is that they didn’t point out ways one might reasonably hope to get around this problem, specifically? (Or, alternatively, is the cause of your engaging in mockery something other than a motivation to express propositions (implicit or explicit) that readers might be benefited by believing? For example, is it a carelessly executed side-effect of a drive to hold your own thinking to high standards, and so to avoid looking for ways to make excuses for intellectual constructs you might be biased in favor of?)
It is the case that Luke, for instance of an author of this post, wrote this relatively recently. While there is understanding that bruteforcing is the ideal, I do not see understanding of how important of a requirement that is. We don’t know how long is the shortest physics, and we do not know how long is the shortest god-did-it, but if we can build godlike AI inside our universe then the simplest god is at most same length and unfortunately you can’t know there isn’t a shorter one. Note: I am an atheist, before you jump onto he must be theist if he dislikes that statement. Nonetheless I do not welcome pseudo-scientific justifications of atheism.
edit: also by the way, if we find a way to exploit quantum mechanics to make a halting oracle and do true Solomonoff induction some day, that would make physics of our universe incomputable and this amazing physics itself would not be representable as Turing machine tape, i.e. would be a truth that Solomonoff induction we could do using this undiscovered physics would not be able to even express as a hypothesis. Before you go onto Solomonoff induction you need to understand Halting problem and variations, otherwise you’ll be assuming that someday we’ll overcome Halting problem, which is kind of fine except if we do then we just get ourselves the Halting problem of the second order plus a physics where Solomonoff induction doable with the oracle does not do everything.
You don’t really use Solomonoff induction with Bayes. It directly gives predictions (and in AIXI, evaluates actions). You could use 2^-len prior on hypotheses in Bayesian inference, but this is not Solomonoff induction, it just looks superficially similar but doesn’t share the properties.
Point entirely retracted. I am completely wrong here—I did not see how it works out to identical if you assign certainty of 1 to data. (didn’t occur to me to assign certainty of 1 to anything). Thanks for corrections.
I haven’t read Solomonoff’s original papers, but e.g. Shane Legg’s paper on Solomonoff induction gives a description that amounts to “use Bayes with a 2^-program_length prior”. How do you see the two differing and (if this isn’t obvious) why does it matter?
Well then you have to be using it on each and every program for each bit of uncertain data. (What is uncertain data? The regular definition can model the errors you might have in measurement apparatus as well; it is not clear what is the advantage of putting the uncertainty outside of the codes themselves)
If I am to assume that the hypotheses do not get down to their minimum length (no exhaustive uncomputable search), then there is a very significant penalty from 2^-length prior for actual theories versus just storing the data. The computable generative methods (such as a human making hypothesis) don’t find shortest codes, they may be generating codes that are e.g. the size of the shortest code squared, or even faster growing function.
edit: Also, btw, on the sum over all programs: consider the program that never runs past N bits so the bits after N do not matter. There is 1 program of length N, 2 programs of length N+1 , 4 programs of length N+2, and so on, that are identical to this program when run. Got to be careful counting and not forget you care about prefixes.
Perhaps I’m being dim, but it doesn’t look as if you answered my question: what differences do you see between “Solomonoff induction” and “Bayesian inference with a 2^-length prior”?
It sounds (but I may be misunderstanding) as if you’re contrasting “choose the shortest program consistent with the data” with “do Bayesian inference with all programs as hypothesis space and the 2^-length prior” in which case you’re presumably proposing that “Solomonoff induction” should mean the former. But—I have now found at least one of Solomonoff’s articles—it’s the latter, not the former, that Solomonoff proposed. (I quote from “A preliminary report on a general theory of inductive inference”, revised version. From the preface: “The main point of the report is Equation (5), Section 11”. From section 12, entitled “An interpretation of equation (5)”: “Equation (5) then enables us to use this model to obtain a priori probabilities to be used in computation of a posteriori probabilities using Bayes’ Theorem.”)
Solomonoff induction is uncomputable and any computable method necessarily fails to find shortest-possible programs; this is well known. If this fact is the basis for your objection to using the term “Solomonoff induction” to denote Bayes-with-length-prior then I’m afraid I don’t understand why; could you explain more?
It’s conventional to assume a prefix-free representation of programs when discussing this sort of formalism. I don’t see that this has anything to do with the distinction (if there is one) between Solomonoff and Bayes-with-length-prior.
Firstly, on what is conventional, it is sadly the case that in the LW conversations observed by me it is not typical that either party recollects the magnitude of the importance of having ALL programs as the hypothesis space, or any other detail, even the detail so important as ‘the output’s beginning is matched against observed data’.
I’m sorry I was thinking in the context whereby the hypothesis space is not complete (note: in the different papers on subjects if i recall correctly the hypothesis can either refer to the final string—the prediction—or to the code). Indeed in the case where it is applied to the complete set of hypotheses, it works fine. It still should be noted though that it is Solomonoff probability (algorithmic probability), that can be used with Bayes, resulting in Solomonoff induction.
It was my mistake with regards to what Solomonoff exactly specified, given that there is a multitude of different equivalent up to constant definitions that end up called same name. By Solomonoff induction I was referring to the one described in the article.
OK. (My real purpose in posting this comment is to remark, in case you should care, that it’s not I who have been downvoting you in this thread.)
Thanks for correction in any case. This was genuinely an error on my part. Amusingly it was at +1 while noting that MWI (which someone else brought up) doesn’t begin with the observed data, was much more negatively treated.
You discussed this over here too, with greater hostility:
I’m trying to figure out whether, when you say “someone”, you mean someone upthread or one of the original authors of the post. Because if it’s the post authors, then I get to accuse you of not caring enough about refraining from heaping abuse on writers who don’t deserve it to actually bother to check whether they made the mistake you mocked them for. The post includes discussion of ideals vs. approximations in the recipe metaphor and in the paragraph starting “The actual process above might seem a little underwhelming...”:
Everyone knows there are practical difficulties. Some people expect that someone will almost certainly find a way to mitigate them. You tend to treat people as idiots for going ahead with the version of the discussion that presumes that.
With respect to your objections about undiscoverable short programs, presumably people would look for ways to calibrate the relationship between the description lengths of the theories their procedures do discover (or the lengths of the parts of those theories) and the typical probabilities of observations, perhaps using toy domains where everything is computable and Solomonoff probabilities can be computed exactly. Of course, by diagonalization there are limits to the power of any finite such procedure, but it’s possible to do better than assuming that the only programs are the ones you can find explicitly. Do you mean to make an implicit claim by your mockery which special-cases to the proposition that what makes the post authors deserve to be mocked is that they didn’t point out ways one might reasonably hope to get around this problem, specifically? (Or, alternatively, is the cause of your engaging in mockery something other than a motivation to express propositions (implicit or explicit) that readers might be benefited by believing? For example, is it a carelessly executed side-effect of a drive to hold your own thinking to high standards, and so to avoid looking for ways to make excuses for intellectual constructs you might be biased in favor of?)
It is the case that Luke, for instance of an author of this post, wrote this relatively recently. While there is understanding that bruteforcing is the ideal, I do not see understanding of how important of a requirement that is. We don’t know how long is the shortest physics, and we do not know how long is the shortest god-did-it, but if we can build godlike AI inside our universe then the simplest god is at most same length and unfortunately you can’t know there isn’t a shorter one. Note: I am an atheist, before you jump onto he must be theist if he dislikes that statement. Nonetheless I do not welcome pseudo-scientific justifications of atheism.
edit: also by the way, if we find a way to exploit quantum mechanics to make a halting oracle and do true Solomonoff induction some day, that would make physics of our universe incomputable and this amazing physics itself would not be representable as Turing machine tape, i.e. would be a truth that Solomonoff induction we could do using this undiscovered physics would not be able to even express as a hypothesis. Before you go onto Solomonoff induction you need to understand Halting problem and variations, otherwise you’ll be assuming that someday we’ll overcome Halting problem, which is kind of fine except if we do then we just get ourselves the Halting problem of the second order plus a physics where Solomonoff induction doable with the oracle does not do everything.