This doesn’t address what I said at all. We don’t speak of “the” universal prior because there’s a specific UTM it’s defined with respect to, we speak of “the” universal prior because we don’t much care about the distinction between different universal priors! The above article is still about doing Bayesian updating starting with a universal prior. That which particular universal prior you start from doesn’t matter much is not new information and in no way supports your claim that any “reasonable” prior—whatever that might mean—will also have this same property.
I think when he says “the choice of the reference machine doesn’t matter too much” and “the choice of reference machine really doesn’t matter except for very small data sets” he literally means those things. I agree that my position on this is not new.
Sorry, how does “literally” differ from what I stated? And you seem to be stating something very different from him. He is just stating that the UTM used to define the universal prior is irrelevant. You are claiming that any “reasonable” prior, for some unspecified but expansive-sounding notion of “reasonable”, has the same universal property as a universal prior.
That seems like quite a tangle, and alas, am not terribly interested in it . But:
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Sorry, what? Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort? Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! Generating a universal prior will only get you universal priors. Perhaps you were thinking of some other way of generating a prior from a universal computer? Because that isn’t what’s being talked about.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
You have still done nothing to demonstrate this. The potential for dependence on priors has been demonstrated elsewhere (anti-inductive priors, etc). The “updating scheme” is Bayes’ Rule. (This might not suffice in the continuous-time case, but you explicitly invoked the discrete-time case above!) But to determine all those probabilities, you need to look at the prior. Seriously, show me (or just point me to) some math. If you refuse to say what makes a prior “reasonable”, what are you actually claiming? That the set of priors with this property is large in some appropriate sense? Please name what sense. Why should we not just use some equivalent of maxent, if what you say is true?
“Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort?”
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
“Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! ”
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
“You have still done nothing to demonstrate this.”
Actually, I think I give up trying to explain. From my perspective you seem to have some kind of tangle around the word “universal”. “Universal” could usefully refer to “universal computation” or to a prior that covers “every hypothesis in the universe”.
There is also the “universal prior”—but I don’t think “universal” there has quite the same significance that you seem to think it does. There seems to be repeated miscommunication going on in this area.
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
Firstly, please use Markdown quotes for ease of reading? :-/
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
Indeed, but I don’t think that’s really the property under discussion.
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
....huh? Maybe you are misunderstanding the procedure in question here. We are not taking arbitrary computations that output distributions and using those distributions. That would get you arbitrary computable distributions. Rather, we are taking arbitrary universal computers/UTMs/Turing-complete programming languages/whatever you want to call them, and then generating a distribution as “probability of x is sum over 2^-length over all programs that output something beginning with x” (possibly normalized). I.e. we are taking a reference machine and generating the corresponding universal prior.
Not only will this not get you “any distribution you can compute”, it won’t get you any distributions you can compute at all. The resulting distribution is always uncomputable. (And hence, in particular, not practical, and presumably not “reasonable”, whatever that may mean.)
Am I mistaken in asserting that this is what was under discussion?
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
You don’t have to attempt to convince me, but do note that despite asserting it repeatedly you have, in fact, done zero to establish the truth of this assertion / validity of this intuition, which I have good reason to believe to be unlikely, as I described earlier.
FWIW, what I meant was that—by altering the reference machine, p() - for all bitstrings less than a zillion bits long—can be made into any set of probabilities you like—provided they don’t add up to more than 1, of course.
The reference machine defines the resulting probability distribution completely.
AH! So you are making a comment on the use of universal priors to approximate arbitrary finite priors (and hence presumably vice versa). That is interesting, though I’m not sure what it has to do with eventual convergence. You should have actually stated that at some point!
This doesn’t address what I said at all. We don’t speak of “the” universal prior because there’s a specific UTM it’s defined with respect to, we speak of “the” universal prior because we don’t much care about the distinction between different universal priors! The above article is still about doing Bayesian updating starting with a universal prior. That which particular universal prior you start from doesn’t matter much is not new information and in no way supports your claim that any “reasonable” prior—whatever that might mean—will also have this same property.
I think when he says “the choice of the reference machine doesn’t matter too much” and “the choice of reference machine really doesn’t matter except for very small data sets” he literally means those things. I agree that my position on this is not new.
Sorry, how does “literally” differ from what I stated? And you seem to be stating something very different from him. He is just stating that the UTM used to define the universal prior is irrelevant. You are claiming that any “reasonable” prior, for some unspecified but expansive-sounding notion of “reasonable”, has the same universal property as a universal prior.
That seems like quite a tangle, and alas, am not terribly interested in it . But:
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
Sorry, what? Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort? Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! Generating a universal prior will only get you universal priors. Perhaps you were thinking of some other way of generating a prior from a universal computer? Because that isn’t what’s being talked about.
You have still done nothing to demonstrate this. The potential for dependence on priors has been demonstrated elsewhere (anti-inductive priors, etc). The “updating scheme” is Bayes’ Rule. (This might not suffice in the continuous-time case, but you explicitly invoked the discrete-time case above!) But to determine all those probabilities, you need to look at the prior. Seriously, show me (or just point me to) some math. If you refuse to say what makes a prior “reasonable”, what are you actually claiming? That the set of priors with this property is large in some appropriate sense? Please name what sense. Why should we not just use some equivalent of maxent, if what you say is true?
“Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort?”
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
“Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! ”
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
“You have still done nothing to demonstrate this.”
Actually, I think I give up trying to explain. From my perspective you seem to have some kind of tangle around the word “universal”. “Universal” could usefully refer to “universal computation” or to a prior that covers “every hypothesis in the universe”. There is also the “universal prior”—but I don’t think “universal” there has quite the same significance that you seem to think it does. There seems to be repeated miscommunication going on in this area.
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
Firstly, please use Markdown quotes for ease of reading? :-/
Indeed, but I don’t think that’s really the property under discussion.
....huh? Maybe you are misunderstanding the procedure in question here. We are not taking arbitrary computations that output distributions and using those distributions. That would get you arbitrary computable distributions. Rather, we are taking arbitrary universal computers/UTMs/Turing-complete programming languages/whatever you want to call them, and then generating a distribution as “probability of x is sum over 2^-length over all programs that output something beginning with x” (possibly normalized). I.e. we are taking a reference machine and generating the corresponding universal prior.
Not only will this not get you “any distribution you can compute”, it won’t get you any distributions you can compute at all. The resulting distribution is always uncomputable. (And hence, in particular, not practical, and presumably not “reasonable”, whatever that may mean.)
Am I mistaken in asserting that this is what was under discussion?
You don’t have to attempt to convince me, but do note that despite asserting it repeatedly you have, in fact, done zero to establish the truth of this assertion / validity of this intuition, which I have good reason to believe to be unlikely, as I described earlier.
FWIW, what I meant was that—by altering the reference machine, p() - for all bitstrings less than a zillion bits long—can be made into any set of probabilities you like—provided they don’t add up to more than 1, of course.
The reference machine defines the resulting probability distribution completely.
AH! So you are making a comment on the use of universal priors to approximate arbitrary finite priors (and hence presumably vice versa). That is interesting, though I’m not sure what it has to do with eventual convergence. You should have actually stated that at some point!