The prior is by definition whatever it was rational to believe before the acquisition of new evidence (assuming a perfect Bayesian, anyway).
Nope, this isn’t part of the definition of the prior, and I don’t see how it could be. The prior is whatever you actually believe before any evidence comes in.
If you have a procedure to determine which priors are “rational” before looking at the evidence, please share it with us. Some people here believe religiously in maxent, others swear by the universal prior, I personally rather like reference priors, but the Bayesian apparatus doesn’t really give us a means of determining the “best” among those. I wrote about these topics here before. If you want the one-word summary, the area is a mess.
I want to believe that there is some optimal general prior, but it seems much more likely that we do not live in so convenient a world.
But if you can evaluate how good a prior is, then there has to be an optimal one (or several). You have to have something as your prior, and so whichever one is the best out of those you can choose is the one you should have. As for how certain you are that it’s the best, it’s (to some extent) turtles all the way down.
Instead of using “optimal general prior”, I should have said that I was pessimistic about the existence of a standard for evaluating priors (or, more properly, prior probability distributions) that is optimal in all circumstances, if that’s any clearer.
Having thought about the problem some more, though, I think my pessimism may have been premature.
A prior probability distribution is nothing more than a weighted set of hypotheses. A perfect Bayesian would consider every possible hypothesis, which is impossible unless hypotheses are countable, and they aren’t; the ideal for Bayesian reasoning as I understand it is thus unattainable, but this doesn’t mean that there are benefits to be found in moving toward that ideal.
So, perfect Bayesian or not, we have some set of hypotheses which need to be located before we can consider them and assign them a probabilistic weight. Before we acquire any rational evidence at all, there is necessarily only one factor that we can use to distinguish between hypotheses: how hard they are to locate. If it is also true that hypotheses which are easier to locate make more predictions and that hypotheses which make more predictions are more useful (and while I have not seen proofs of these propositions I’m inclined to suspect that they exist), then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
This reduces the problem of prior probability evaluation to the problem of locate-ability evaluation, to which it seems maxent and its fellows are proposed answers. It’s again possible there is no objectively best way to evaluate locate-ability, but I don’t yet see a reason for this to be so.
Again, if I’ve mis-thought or failed to justify a step in my reasoning, please call me on it.
If it is also true that hypotheses which are easier to locate make more predictions
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
The proper formalization for your concept of locate-ability is the Solomonoff prior. Unfortunately we can’t do inference based on it because it’s uncomputable.
Maxent and friends aren’t motivated by a desire to formalize locate-ability. Maxent is the “most uniform” distribution on a space of hypotheses; the “Jeffreys rule” is a means of constructing priors that are invariant under reparameterizations of the space of hypotheses; “matching priors” give you frequentist coverage guarantees, and so on.
Please don’t take my words for gospel just because I sound knowledgeable! At this point I recommend you to actually study the math and come to your own conclusions. Maybe contact user Cyan, he’s a professional statistician who inspired me to learn this stuff. IMO, discussing Bayesianism as some kind of philosophical system without digging into the math is counterproductive, though people around here do that a lot.
I’m in the process of digging into the math, so hopefully some point soon I’ll be able to back up my suspicions in a more rigorous way.
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
I was talking about the number of predictions, not their strength. So Hypothesis 1 predicts any sequence of coin-flips that converges on 50%, and Hypothesis 2 predicts only sequences that repeat HTTTHHTHTHTTTT. Hypothesis 1 explains many more possible worlds than Hypothesis 2, and so without evidence as to which world we inhabit, Hypothesis 1 is much more likely.
Since I’ve already conceded that being a Perfect Bayesian is impossible, I’m not surprised to hear that measuring locate-ability is likewise impossible (especially because the one reduces to the other). It just means that we should determine prior probabilities by approximating Solomonoff complexity as best we can.
Thanks for taking the time to comment, by the way.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be either HTTTHHTHTHTTTT repeated forever, or TTHTHTTTHTHHHHH repeated forever. The second one is harder to locate, but describes two possible worlds rather than one.
Maybe your idea can be fixed somehow, but I see no way yet. Keep digging.
I’ve just reread Eliezer’s post on Occam’s Razor and it seems to have clarified my thinking a little.
I originally said:
If it is also true that hypotheses which are easier to locate make more predictions… then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
But I would now say:
If it is also true that hypotheses with a shorter minimum message length make more predictions relative to that minimum message length than do hypotheses with longer MMLs… then we are perfectly justified in assigning a probability to a hypothesis based on MML.
This solves the problem your counterexample presents: Hypothesis 1 describes only one possible world, but Hypothesis 2 requires say, ~30 more bits of information (for those particular strings of results, plus a disjunction) to describe only two possible worlds, making it 2^30 / 2 times less likely.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be HTTTHHTHTHTTTT repeated forever, where the can take different values on each repetition. The second hypothesis is harder to locate but describes an infinite number of possible worlds :-)
The problem with this counterexample is that you can’t actually repeat something forever.
Even taking the case where we repeat each sequence 1000 times, which seems like it should be similar, you’ll end up with 1000 coin flips and 15000 coin flips for Hypothesis 1 and Hypothesis 2, respectively. So the odds of being in a world where Hypothesis 1 is true are 1 in 2^1000, but the odds of being in a world where Hypothesis 2 is true are 1 in 2^15000.
It’s an apples to balloons comparison, basically.
(I spent about twenty minutes staring at an empty comment box and sweating blood before I figured this out, for the record.)
I think this is still wrong. Take the finite case where both hypotheses are used to explain sequences of a billion throws. Then the first hypothesis describes one world, and the second one describes an exponentially huge number of worlds. You seem to think that the length of the sequence should depend on the length of the hypothesis, and I don’t understand why.
It’s again possible there is no objectively best way
I’m not sure I’m willing to grant that’s impossible in principle. Presumably, you need to find some way of choosing your priors, and some time later you can check your calibration, and you can then evaluate the effectiveness of one method versus another.
If there’s any way to determine whether you’ve won bets in a series, then it’s possible to rank methods for choosing the correct bet. And that general principle can continue all the way down. And if there isn’t any way of determining whether you’ve won, then I’d wonder if you’re talking about anything at all (weird thought experiments aside).
Nope, this isn’t part of the definition of the prior, and I don’t see how it could be. The prior is whatever you actually believe before any evidence comes in.
If you have a procedure to determine which priors are “rational” before looking at the evidence, please share it with us. Some people here believe religiously in maxent, others swear by the universal prior, I personally rather like reference priors, but the Bayesian apparatus doesn’t really give us a means of determining the “best” among those. I wrote about these topics here before. If you want the one-word summary, the area is a mess.
Thanks for the links (and your post!), I now have a much clearer idea of the depths of my ignorance on this topic.
I want to believe that there is some optimal general prior, but it seems much more likely that we do not live in so convenient a world.
But if you can evaluate how good a prior is, then there has to be an optimal one (or several). You have to have something as your prior, and so whichever one is the best out of those you can choose is the one you should have. As for how certain you are that it’s the best, it’s (to some extent) turtles all the way down.
Instead of using “optimal general prior”, I should have said that I was pessimistic about the existence of a standard for evaluating priors (or, more properly, prior probability distributions) that is optimal in all circumstances, if that’s any clearer.
Having thought about the problem some more, though, I think my pessimism may have been premature.
A prior probability distribution is nothing more than a weighted set of hypotheses. A perfect Bayesian would consider every possible hypothesis, which is impossible unless hypotheses are countable, and they aren’t; the ideal for Bayesian reasoning as I understand it is thus unattainable, but this doesn’t mean that there are benefits to be found in moving toward that ideal.
So, perfect Bayesian or not, we have some set of hypotheses which need to be located before we can consider them and assign them a probabilistic weight. Before we acquire any rational evidence at all, there is necessarily only one factor that we can use to distinguish between hypotheses: how hard they are to locate. If it is also true that hypotheses which are easier to locate make more predictions and that hypotheses which make more predictions are more useful (and while I have not seen proofs of these propositions I’m inclined to suspect that they exist), then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
This reduces the problem of prior probability evaluation to the problem of locate-ability evaluation, to which it seems maxent and its fellows are proposed answers. It’s again possible there is no objectively best way to evaluate locate-ability, but I don’t yet see a reason for this to be so.
Again, if I’ve mis-thought or failed to justify a step in my reasoning, please call me on it.
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
The proper formalization for your concept of locate-ability is the Solomonoff prior. Unfortunately we can’t do inference based on it because it’s uncomputable.
Maxent and friends aren’t motivated by a desire to formalize locate-ability. Maxent is the “most uniform” distribution on a space of hypotheses; the “Jeffreys rule” is a means of constructing priors that are invariant under reparameterizations of the space of hypotheses; “matching priors” give you frequentist coverage guarantees, and so on.
Please don’t take my words for gospel just because I sound knowledgeable! At this point I recommend you to actually study the math and come to your own conclusions. Maybe contact user Cyan, he’s a professional statistician who inspired me to learn this stuff. IMO, discussing Bayesianism as some kind of philosophical system without digging into the math is counterproductive, though people around here do that a lot.
I’m in the process of digging into the math, so hopefully some point soon I’ll be able to back up my suspicions in a more rigorous way.
I was talking about the number of predictions, not their strength. So Hypothesis 1 predicts any sequence of coin-flips that converges on 50%, and Hypothesis 2 predicts only sequences that repeat HTTTHHTHTHTTTT. Hypothesis 1 explains many more possible worlds than Hypothesis 2, and so without evidence as to which world we inhabit, Hypothesis 1 is much more likely.
Since I’ve already conceded that being a Perfect Bayesian is impossible, I’m not surprised to hear that measuring locate-ability is likewise impossible (especially because the one reduces to the other). It just means that we should determine prior probabilities by approximating Solomonoff complexity as best we can.
Thanks for taking the time to comment, by the way.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be either HTTTHHTHTHTTTT repeated forever, or TTHTHTTTHTHHHHH repeated forever. The second one is harder to locate, but describes two possible worlds rather than one.
Maybe your idea can be fixed somehow, but I see no way yet. Keep digging.
I’ve just reread Eliezer’s post on Occam’s Razor and it seems to have clarified my thinking a little.
I originally said:
But I would now say:
This solves the problem your counterexample presents: Hypothesis 1 describes only one possible world, but Hypothesis 2 requires say, ~30 more bits of information (for those particular strings of results, plus a disjunction) to describe only two possible worlds, making it 2^30 / 2 times less likely.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be HTTTHHTHTHTTTT repeated forever, where the can take different values on each repetition. The second hypothesis is harder to locate but describes an infinite number of possible worlds :-)
If at first you don’t succeed, try, try again!
The problem with this counterexample is that you can’t actually repeat something forever.
Even taking the case where we repeat each sequence 1000 times, which seems like it should be similar, you’ll end up with 1000 coin flips and 15000 coin flips for Hypothesis 1 and Hypothesis 2, respectively. So the odds of being in a world where Hypothesis 1 is true are 1 in 2^1000, but the odds of being in a world where Hypothesis 2 is true are 1 in 2^15000.
It’s an apples to balloons comparison, basically.
(I spent about twenty minutes staring at an empty comment box and sweating blood before I figured this out, for the record.)
I think this is still wrong. Take the finite case where both hypotheses are used to explain sequences of a billion throws. Then the first hypothesis describes one world, and the second one describes an exponentially huge number of worlds. You seem to think that the length of the sequence should depend on the length of the hypothesis, and I don’t understand why.
That is an awesome counter-example, thank you. I think I may wait to ponder this further until I have a better grasp of the math involved.
I’m not sure I’m willing to grant that’s impossible in principle. Presumably, you need to find some way of choosing your priors, and some time later you can check your calibration, and you can then evaluate the effectiveness of one method versus another.
If there’s any way to determine whether you’ve won bets in a series, then it’s possible to rank methods for choosing the correct bet. And that general principle can continue all the way down. And if there isn’t any way of determining whether you’ve won, then I’d wonder if you’re talking about anything at all (weird thought experiments aside).