I finally decided it’s worth some of my time to try to gain a deeper understanding of decision theory...
Question: Can Bayesians transform decisions under ignorance into decisions under risk by assuming the decision maker can at least assign probabilities to outcomes using some kind of ignorance prior(s)?
Details: “Decision under uncertainty” is used to mean various things, so for clarity’s sake I’ll use “decision under ignorance” to refer to a decision for which the decision maker does not (perhaps “cannot”) assign probabilities to some of the possible outcomes, and I’ll use “decision under risk” to refer to a decision for which the decision maker does assign probabilities to all of the possible outcomes.
There is much debate over which decision procedure to use when facing a decision under ignorance when there is no act that dominates the others. Some proposals include: the leximin rule, the optimism-pessimism rule, the minimax regret rule, the info-gap rule, and the maxipok rule.
However, there is broad agreement that when facing a decision under risk, rational agents maximize expected utility. Because we have a clearer procedure for dealing with decisions under risk than we do for dealing with decisions under ignorance, many decision theorists are tempted to transform decisions under ignorance into decisions under risk by appealing to the principle of insufficient reason: “if you have literally no reason to think that one state is more probable than another, then one should assign equal probability to both states.”
And if you’re a Bayesian decision-maker, you presumably have some method for generating ignorance priors, whether or not that method always conforms to the principle of insufficient reason, and even if you doubt you’ve found the final, best method for assigning ignorance priors.
So if you’re a Bayesian decision-maker, doesn’t that mean that you only ever face decisions under risk, because at they very least you’re assigning ignorance priors to the outcomes for which you’re not sure how to assign probabilities? Or have I misunderstood something?
You could always choose to manage ignorance by choosing a prior. It’s not obvious whether you should. But as it turns out, we have results like the complete class theorem, which imply that EU maximization with respect to an appropriate prior is the only “Pareto efficient” decision procedure (any other decision can be changed so as to achieve a higher reward in every possible world).
This analysis breaks down in the presence of computational limitations; in that case it’s not clear that a “rational” agent should have even an implicit representation of a distribution over possible worlds (such a distribution may be prohibitively expensive to reason about, much less integrate exactly over), so maybe a rational agent should invoke some decision rule other than EU maximization.
The situation is sort of analogous to defining a social welfare function. One approach is to take a VNM utility function for each individual and then maximize total utility. At face value it’s not obvious if this is the right thing to do—choosing an exchange rate between person A’s preferences and person B’s preferences feels pretty arbitrary and potentially destructive (just like choosing prior odds between possible world A and possible world B). But as it turns out, if you do anything else then you could have been better off by picking some particular exchange rate and using it consistently (again, modulo practical limitations).
as it turns out, we have results like the complete class theorem, which imply that EU maximization with respect to an appropriate prior is the only “Pareto efficient” decision procedure (any other decision can be changed so as to achieve a higher reward in every possible world).
I found several books which give technical coverage of statistical decision theory, complete classes, and admissibility rules (Berger 1985; Robert 2001; Jaynes 2003; Liese & Miescke 2010), but I didn’t find any clear explanation of exactly how the complete class theorem implies that “EU maximization with respect to an appropriate prior is the only ‘Pareto efficient’ decision procedure (any other decision can be changed so as to achieve a higher reward in every possible world).”
Do you know any source which does so, or are you able to explain it? This seems like a potentially significant argument for EUM that runs independently of the standard axiomatic approaches, which have suffered many persuasive attacks.
The formalism of the complete class theorem applies to arbitrary decisions, the Bayes decision procedures correspond to EU maximization with respect to an appropriate choice of prior. An inadmissable decision procedure is not Pareto efficient, in the sense that a different decision procedure does better in all possible worlds (which feels analogous to making all possible people happier). Does that make sense?
There is a bit of weasel room, in that the complete class theorem assumes that the data is generated by a probabilistic process in each possible world. This doesn’t seem like an issue, because you just absorb the observation into the choice of possible world, but this points to a bigger problem:
If you define “possible worlds” finely enough, such that e.g. each (world, observation) pair is a possible world, then the space of priors is very large (e.g., you could put all of your mass on one (world, observation) pair for each observation) and can be used to justify any decision. For example, if we are in the setting of AIXI, any decision procedure can trivially be described as EU maximization under an appropriate prior: if the decision procedure outputs f(X) on input X, it corresponds to EU maximization against a prior which has the universe end after N steps with probability 2^(-N), and when the universe ends after you seeing X, you receive an extra reward if your last output was f(X).
So the conclusion of the theorem isn’t so interesting, unless there are few possible worlds. When you argue for EUM, you normally want some stronger statement than saying that any decision procedure corresponds to some prior.
What AlexMennen said. For a Bayesian there’s no difference in principle between ignorance and risk.
One wrinkle is that even Bayesians shouldn’t have prior probabilities for everything, because if you assign a prior probability to something that could indirectly depend on your decision, you might lose out.
A good example is the absent-minded driver problem. While driving home from work, you pass two identical-looking intersections. At the first one you’re supposed to go straight, at the second one you’re supposed to turn. If you do everything correctly, you get utility 4. If you goof and turn at the first intersection, you never arrive at the second one, and get utility 0. If you goof and go straight at the second, you get utility 1. Unfortunately, by the time you get to the second one, you forget whether you’d already been at the first, which means at both intersections you’re uncertain about your location.
If you treat your uncertainty about location as a probability and choose the Bayesian-optimal action, you’ll get demonstrably worse results than if you’d planned your actions in advance or used UDT. The reason, as pointed out by taw and pengvado, is that your probability of arriving at the second intersection depends on your decision to go straight or turn at the first one, so treating it as unchangeable leads to weird errors.
One wrinkle is that even Bayesians shouldn’t have prior probabilities for everything, because if you assign a prior probability to something that could indirectly depend on your decision, you might lose out.
… your probability of arriving at the second intersection depends on your decision to go straight or turn at the first one, so treating it as unchangeable leads to weird errors.
“Unchangeable” is a bad word for this, as it might well be thought of as unchangeable, if you won’t insist on knowing what it is. So a Bayesian may “have probabilities for everything”, whatever that means, if it’s understood that those probabilities are not logically transparent and some of the details about them won’t necessarily be available when making any given decision. After you do make a decision that controls certain details of your prior, those details become more readily available for future decisions.
In other words, the problem is not in assigning probabilities to too many things, but in assigning them arbitrarily and thus incorrectly. If the correct assignment of probability is such that the probability depends on your future decisions, you won’t be able to know this probability, so if you’ve “assigned” it in such a way that you do know what it is, you must have assigned a wrong thing. Prior probability is not up for grabs etc.
so treating it as unchangeable leads to weird errors.
The prior probability is unchangeable. It’s just that you make your decision based on the posterior probability taking into account each decision. At least, that’s what you do if you use EDT. I’m not entirely familiar with the other decision theories, but I’m pretty sure they all have prior probabilities for everything.
So if you’re a Bayesian decision-maker, doesn’t that mean that you only ever face decisions under risk, because at they very least you’re assigning ignorance priors to the outcomes for which you’re not sure how to assign probabilities?
Correct. A Bayesian always has a probability distribution over possible states of the world, and so cannot face a decision under ignorance as you define it. Coming up with good priors is hard, but to be a Bayesian, you need a prior.
Bayesian decisions cannot be made under an inability to assign a probability distribution to the outcomes.
As mentioned, you can consider a Bayesian probability distribution of what the correct distributions will be; if you have no reason to say which state, if any, is more probable, then they have the same meta-distribution as each other: If you know that a coin is unfair, but have no information about which way it is biased, then you should divide the first bet evenly between heads and tails, (assuming logarithmic payoffs).
It might make sense to consider the Probability distribution of the fairness of the coin as a graph: the X axis, from 0-1 being the chance of each flip coming up heads, and the Y axis being the odds that the coin has that particular property; because of our prior information, there is a removable discontinuity at x=1/2. Initially, the graph is flat, but after the first flip it changes: if it came up tails, the odds of a two-headed coin are now 0, the odds of a .9999% heads coin are infinitesimal, and the odds of a tail-weighted coin are significantly greater: Having no prior information on how weighted the coin is, you could assume that all weightings (except fair) are equally likely. After the second flip, however, you have information about what the bias of the coin was- but no information about whether the bias of the coin is time-variable, such that it is always heads on prime flips, and always tails on composite flips.
If you consider that the coin could be rigged to a sequence equally likely as that the result of the flip could be randomly determined each time, then you have a problem. No information can update some specific lacks of a prior probability.
This reminds me of a recent tangent on Kelly betting. Apparently it’s claimed that the unusalness of this optimum betting strategy shows that you should treat risk and ignorance differently—but of course the difference between the two situations is entirely accounted for by two different conditional probability distributions. So you can sort of think of situations (that is, the probability distribution describing possible outcomes) as “risk-like” or “ignorance-like.”
If you’re talking about what I think you’re talking about, then by “risk”, you mean “frequentist probability distribution over outcomes”, and by “ignorance”, you mean “Bayesian probability distribution over what the correct frequentist probability distribution over outcomes is”, which is not the way Luke was defining the terms.
I finally decided it’s worth some of my time to try to gain a deeper understanding of decision theory...
Question: Can Bayesians transform decisions under ignorance into decisions under risk by assuming the decision maker can at least assign probabilities to outcomes using some kind of ignorance prior(s)?
Details: “Decision under uncertainty” is used to mean various things, so for clarity’s sake I’ll use “decision under ignorance” to refer to a decision for which the decision maker does not (perhaps “cannot”) assign probabilities to some of the possible outcomes, and I’ll use “decision under risk” to refer to a decision for which the decision maker does assign probabilities to all of the possible outcomes.
There is much debate over which decision procedure to use when facing a decision under ignorance when there is no act that dominates the others. Some proposals include: the leximin rule, the optimism-pessimism rule, the minimax regret rule, the info-gap rule, and the maxipok rule.
However, there is broad agreement that when facing a decision under risk, rational agents maximize expected utility. Because we have a clearer procedure for dealing with decisions under risk than we do for dealing with decisions under ignorance, many decision theorists are tempted to transform decisions under ignorance into decisions under risk by appealing to the principle of insufficient reason: “if you have literally no reason to think that one state is more probable than another, then one should assign equal probability to both states.”
And if you’re a Bayesian decision-maker, you presumably have some method for generating ignorance priors, whether or not that method always conforms to the principle of insufficient reason, and even if you doubt you’ve found the final, best method for assigning ignorance priors.
So if you’re a Bayesian decision-maker, doesn’t that mean that you only ever face decisions under risk, because at they very least you’re assigning ignorance priors to the outcomes for which you’re not sure how to assign probabilities? Or have I misunderstood something?
You could always choose to manage ignorance by choosing a prior. It’s not obvious whether you should. But as it turns out, we have results like the complete class theorem, which imply that EU maximization with respect to an appropriate prior is the only “Pareto efficient” decision procedure (any other decision can be changed so as to achieve a higher reward in every possible world).
This analysis breaks down in the presence of computational limitations; in that case it’s not clear that a “rational” agent should have even an implicit representation of a distribution over possible worlds (such a distribution may be prohibitively expensive to reason about, much less integrate exactly over), so maybe a rational agent should invoke some decision rule other than EU maximization.
The situation is sort of analogous to defining a social welfare function. One approach is to take a VNM utility function for each individual and then maximize total utility. At face value it’s not obvious if this is the right thing to do—choosing an exchange rate between person A’s preferences and person B’s preferences feels pretty arbitrary and potentially destructive (just like choosing prior odds between possible world A and possible world B). But as it turns out, if you do anything else then you could have been better off by picking some particular exchange rate and using it consistently (again, modulo practical limitations).
I found several books which give technical coverage of statistical decision theory, complete classes, and admissibility rules (Berger 1985; Robert 2001; Jaynes 2003; Liese & Miescke 2010), but I didn’t find any clear explanation of exactly how the complete class theorem implies that “EU maximization with respect to an appropriate prior is the only ‘Pareto efficient’ decision procedure (any other decision can be changed so as to achieve a higher reward in every possible world).”
Do you know any source which does so, or are you able to explain it? This seems like a potentially significant argument for EUM that runs independently of the standard axiomatic approaches, which have suffered many persuasive attacks.
The formalism of the complete class theorem applies to arbitrary decisions, the Bayes decision procedures correspond to EU maximization with respect to an appropriate choice of prior. An inadmissable decision procedure is not Pareto efficient, in the sense that a different decision procedure does better in all possible worlds (which feels analogous to making all possible people happier). Does that make sense?
There is a bit of weasel room, in that the complete class theorem assumes that the data is generated by a probabilistic process in each possible world. This doesn’t seem like an issue, because you just absorb the observation into the choice of possible world, but this points to a bigger problem:
If you define “possible worlds” finely enough, such that e.g. each (world, observation) pair is a possible world, then the space of priors is very large (e.g., you could put all of your mass on one (world, observation) pair for each observation) and can be used to justify any decision. For example, if we are in the setting of AIXI, any decision procedure can trivially be described as EU maximization under an appropriate prior: if the decision procedure outputs f(X) on input X, it corresponds to EU maximization against a prior which has the universe end after N steps with probability 2^(-N), and when the universe ends after you seeing X, you receive an extra reward if your last output was f(X).
So the conclusion of the theorem isn’t so interesting, unless there are few possible worlds. When you argue for EUM, you normally want some stronger statement than saying that any decision procedure corresponds to some prior.
That was clear. Thanks!
What AlexMennen said. For a Bayesian there’s no difference in principle between ignorance and risk.
One wrinkle is that even Bayesians shouldn’t have prior probabilities for everything, because if you assign a prior probability to something that could indirectly depend on your decision, you might lose out.
A good example is the absent-minded driver problem. While driving home from work, you pass two identical-looking intersections. At the first one you’re supposed to go straight, at the second one you’re supposed to turn. If you do everything correctly, you get utility 4. If you goof and turn at the first intersection, you never arrive at the second one, and get utility 0. If you goof and go straight at the second, you get utility 1. Unfortunately, by the time you get to the second one, you forget whether you’d already been at the first, which means at both intersections you’re uncertain about your location.
If you treat your uncertainty about location as a probability and choose the Bayesian-optimal action, you’ll get demonstrably worse results than if you’d planned your actions in advance or used UDT. The reason, as pointed out by taw and pengvado, is that your probability of arriving at the second intersection depends on your decision to go straight or turn at the first one, so treating it as unchangeable leads to weird errors.
“Unchangeable” is a bad word for this, as it might well be thought of as unchangeable, if you won’t insist on knowing what it is. So a Bayesian may “have probabilities for everything”, whatever that means, if it’s understood that those probabilities are not logically transparent and some of the details about them won’t necessarily be available when making any given decision. After you do make a decision that controls certain details of your prior, those details become more readily available for future decisions.
In other words, the problem is not in assigning probabilities to too many things, but in assigning them arbitrarily and thus incorrectly. If the correct assignment of probability is such that the probability depends on your future decisions, you won’t be able to know this probability, so if you’ve “assigned” it in such a way that you do know what it is, you must have assigned a wrong thing. Prior probability is not up for grabs etc.
The prior probability is unchangeable. It’s just that you make your decision based on the posterior probability taking into account each decision. At least, that’s what you do if you use EDT. I’m not entirely familiar with the other decision theories, but I’m pretty sure they all have prior probabilities for everything.
Correct. A Bayesian always has a probability distribution over possible states of the world, and so cannot face a decision under ignorance as you define it. Coming up with good priors is hard, but to be a Bayesian, you need a prior.
Bayesian decisions cannot be made under an inability to assign a probability distribution to the outcomes.
As mentioned, you can consider a Bayesian probability distribution of what the correct distributions will be; if you have no reason to say which state, if any, is more probable, then they have the same meta-distribution as each other: If you know that a coin is unfair, but have no information about which way it is biased, then you should divide the first bet evenly between heads and tails, (assuming logarithmic payoffs).
It might make sense to consider the Probability distribution of the fairness of the coin as a graph: the X axis, from 0-1 being the chance of each flip coming up heads, and the Y axis being the odds that the coin has that particular property; because of our prior information, there is a removable discontinuity at x=1/2. Initially, the graph is flat, but after the first flip it changes: if it came up tails, the odds of a two-headed coin are now 0, the odds of a .9999% heads coin are infinitesimal, and the odds of a tail-weighted coin are significantly greater: Having no prior information on how weighted the coin is, you could assume that all weightings (except fair) are equally likely. After the second flip, however, you have information about what the bias of the coin was- but no information about whether the bias of the coin is time-variable, such that it is always heads on prime flips, and always tails on composite flips.
If you consider that the coin could be rigged to a sequence equally likely as that the result of the flip could be randomly determined each time, then you have a problem. No information can update some specific lacks of a prior probability.
This reminds me of a recent tangent on Kelly betting. Apparently it’s claimed that the unusalness of this optimum betting strategy shows that you should treat risk and ignorance differently—but of course the difference between the two situations is entirely accounted for by two different conditional probability distributions. So you can sort of think of situations (that is, the probability distribution describing possible outcomes) as “risk-like” or “ignorance-like.”
If you’re talking about what I think you’re talking about, then by “risk”, you mean “frequentist probability distribution over outcomes”, and by “ignorance”, you mean “Bayesian probability distribution over what the correct frequentist probability distribution over outcomes is”, which is not the way Luke was defining the terms.