Rival formalizations of a decision problem
Decision theory is not one of my strengths, and I have a question about it.
Is there a consensus view on how to deal with the problem of “rival formalizations”? Peterson (2009) illustrates the problem like this:
Imagine that you are a paparazzi photographer and that rumour has it that actress Julia Roberts will show up in either New York (NY), Los Angeles (LA) or Paris (P). Nothing is known about the probability of these states of the world. You have to decide if you should stay in America or catch a plane to Paris. If you stay and [she] shows up in Paris you get $0; otherwise you get your photos, which you will be able to sell for $10,000. If you catch a plane to Paris and Julia Roberts shows up in Paris your net gain after having paid for the ticket is $5,000, and if she shows up in America you for some reason, never mind why, get $6,000. Your initial representation of the decision problem is visualized in Table 2.13.
Table 2.13
P | LA | NY | |
Stay | $0 | $10k | $10k |
Go to Paris | $5k | $6k | $6k |
Since nothing is known about the probabilities of the states in Table 2.13, you decide it makes sense to regard them as equally probable [see Table 2.14].
Table 2.14
P (1/3) | LA (1/3) | NY (1/3) | |
Stay | $0 | $10k | $10k |
Go to Paris | $5k | $6k | $6k |
The rightmost columns are exactly parallel. Therefore, they can be merged into a single (disjuntive) column, by adding the probabilities of the two rightmost columns together (Table 2.15).
Table 2.15
P (1/3) | LA or NY (2/3) | |
Stay | $0 | $10k |
Go to Paris | $5k | $6k |
However, now suppose that you instead start with Table 2.13 and first merge the two repetitious states into a single state. You would then obtain the decision matrix in Table 2.16.
Table 2.16
P | LA or NY | |
Stay | $0 | $10k |
Go to Paris | $5k | $6k |
Now, since you know nothing about the probabilities of the two states, you decide to regard them as equally probable… This yields the formal representation in Table 2.17, which is clearly different from the one suggested above in Table 2.15.
Table 2.17
P (1/2) | LA or NY (1/2) | |
Stay | $0 | $10k |
Go to Paris | $5k | $6k |
Which formalisation is best, 2.15 or 2.17? It seems question begging to claim that one of them must be better than the other — so perhaps they are equally reasonable? If they are, we have an example of rival formalisations.
Note that the principle of maximising expected value recommends different acts in the two matrices. According to Table 2.15 you should stay, but 2.17 suggests you should go to Paris.
Does anyone know how to solve this problem? If one is not convinced by the illustration above, Peterson (2009) offers a proof that rival representations are possible on pages 33–35.
DT is not my strength either, but it seems like Peterson is just doing a simple slight-of-hand trick. The trickery is concealed in this line:
This conflicts with the earlier claim that our priors for each location are 1⁄3. Changing the way the table looks does not mean that we are allowed to change that prior information. P(P) is still 1⁄3, and P(LA or NY) is still 2⁄3. Calling these “two states” conceals the fact that you have manipulated the prior information, which is what’s creating the “paradox.”
This. Either we know nothing about each of the three states, or we know nothing about either of the two states, not both.
The trick is that until you add in the prior, you don’t actually have a decision theory problem, only part of one; making the states equally probable is adding information, and shuffling states around and then making the states equally probable is adding different information.
A fully-specified decision theory problem is one that could be written as a function which takes as input a strategy and a random number generator, and outputs a utility score. If you have to add any information—priors, structure, expected opponent-strategy—then you have an underspecified problem, which puts you back in the realm of science.
I think this just repeats what Peterson is saying. The difficulty is that there are multiple “reasonable” ways to specify (formalize) the decision problem. So, whether the “rival formalizations” problem is categorized into the domain of science or decision theory, do you know a solution to the problem?
The trick is that when he condenses LA and NY into an “America” option, he is actually throwing away information, thus changing the problem. If he didn’t throw away that information, he couldn’t apply the indifference principle to Paris vs. LA/NY, because knowing that LA and NY are two cities while Paris is one breaks the symmetry that the indifference principle relies on.
Now, it’s entirely reasonable to get that same effect by saying something like “well, Julia Roberts really likes Paris, so her chance of showing up there is twice that of the other cities.” This sort of thing cannot be practically represented by the indifference principle, thus replacing symmetry with arbitrariness. But the arbitrariness is about which problems are possible, not about the solution to an individual problem.
Suppose I subdivide Paris into two districts?
And, presumably, assign one district each to LA and NY? I bet you can guess the answer.
The trouble with these spatial examples is that everyone has all these pesky intuitions lying around. “Space is continuous, of course!” we think, and “cities are made of parts!” But the formal statement of the problem, if the principle of indifference is to be useful, must generally be quite low-information—if the symmetry between the cities is thoroughly broken by us having tons of knowledge about the cities, the example is false as stated.
In order to get in the low-information mindset, it helps to replace meaningful (to us) labels with meaningless ones. In the first “formalization,” all we know is that Julia Roberts could be in one of 3 named cities. Avoiding labels, all we know is that agent 1 could have mutually exclusive and exhaustive properties A, B and C. As soon as the problem is stated this way it becomes clearer that you can’t just condense properties B and C together without changing the problem.
I never said that?
Why does “the formal statement of the problem” matter? Reality doesn’t depend on how the problem is phrased.
You seem to be trying to find an answer that would satisfy a hypothetical teacher not the answer that you would use if you had something to protect.
Suppose I instead called the options A1, B1 and B2. Renaming the options shouldn’t change anything after all.
Why are you surprised that incompatible priors (called “rival formalizations” by Peterson) produce incompatible decisions?
The “consensus” view (also the only one that seems to make sense) is likely that the more accurate map (in this case—literally) of the territory (e.g. three equiprobable cities instead of two equiprobable continents) produces better decisions.
It’s another form of the Bayesian priors problem, which I believe is fundamentally unsolvable. A Solomonoff prior gets you to within a constant factor, given sufficient computational resources, but that constant factor is allowed to be huge. You can drive the problem out from specific domains by gathering enough evidence about them to overwhelm the priors, but with a fixed pool of evidence, you really do have to just guess.
Regarding a set of states as equally probable is significant not for scientific or decision-theoretic reasons, but because it’s a Schelling point in debates over priors. Unfortunately, as you have noticed, there can be arbitrarily many Schelling points, and the number of points increases as you add more vagaries to the problem. There are special cases in which you can derive an ignorance prior from symmetry—such as if the labels on the locations were known to have been shuffled in a uniformly random way—but the labels in this case are not symmetrical.
This problem is similar to the bead jar guess problem. Essentially the problem is where priors come from and it doesn’t have a general solution within the context of Bayesianism. Bayes can tell you how to update your priors, but not what your initial priors should be.
The best thing to do in this problem, when you’re not sure what priors you should assign, is to work backwards and figure out what priors you need to arrive at one solution or the other. In this case:
Let P = Pr(Julia Roberts goes to Paris). Then E(Stay) = 10(1-P) = 10 − 10P and E(Go) = 5P + 6(1-P) = 6 + P. So E(Stay) > E(Go) if 10 − 10 P > 6 + P or 4 > 11 P or P < 4⁄11.
Now, instead of trying to decide “what does the Holy Doctrine of Indifference direct us to do in this situation” we can think about the real question: is the probability that Julia Roberts goes to Paris less than 4/11?
The question being asked here is really “what priors should I assign?”. I don’t think I have the answer, but allow me to restate the problem:
“The subject is known to be headed for one of three places, A, B and C. What is the probability they turn up in each place?” To which the answer is 1⁄3 each by indifference.
Now why did it look ok to us to merge B and C into one option, (B or C)? Because (in the original problem) B and C were cities located in the same country, the U.S., and that prior geographical information had been incorporated into the problem. When we condition on the knowledge that A is in one country (France) and B and C are in another (the U.S.) the problem is, well, no longer symmetrical. And I confess I’m now actually unsure how or if indifference or maximum entropy can be applied to this now asymmetric problem.
You have to use the information about the asymmetry, which in this case involves an actress and geopolitical boundaries. This isn’t a case where there’s an elegant ignorance prior, you just have to actually use your knowledge.
It has been super interesting to read all your contributions to lukeprog’s post; this ‘paradox’ is no doubt interesting, because there seemed to be a shared gut reaction as to something wrong about the above formulation. I have stumbled across this page with the exact dilemma as lukeprog while reading Peterson’s Introduction to Decision Theory book (2nd edition). As you have all pointed out, it would seem that there is something inherently fishy about his formulation of this particular example of ‘Rival Formalisations’.
I think if you follow the logic of the initial axioms he uses in the book, this example that he provides does not follow. To give you some context, he formulates this ‘paradox’ by invoking two ‘axioms’; the principle of insufficient reason (ir) and merger of states (ms). These principles are as follows (Peterson 2009 page 35):
The Principle of Insufficient Reason (IR): If (Pi) is a formal decision problem in which the probabilities of the states are unknown, then it may be transformed into a formal decision problem (Pi)′ in which equal probabilities are assigned to all states.
Merger of states (MS): If two or more states yield identical outcomes under all acts, then these repetitious states should be collapsed into one, and if the probabilities of the two states are known, then they should be added.
From these two principles he first applies the IR rule and then the MS rule to formulate the ‘paradox’ above (Peterson 2009 page 35).
As per the above post by lukeprog, Peterson, in his (1/3, 1⁄3, 1⁄3) (P, LA, NY) example, insists that the probabilities of LA and NY can be added together to make 2/3rds and that this is a correct application of IR and MS.
The principle of IR would instead contradict this application, as he adds the 1⁄3 probabilities from NY and LA as if these probabilities are apriori known. Contrary to Peterson, they are not apriori known. That’s why IR was invoked in the first place. When evoking MS, it requires that the probabilities of the states to be known in order for the probabilities to be added. From IR, we know that these probabilities have, rather, been arbitrarily assigned into equal proportions, precisely because the probabilities of these states (P, NY, LA) in question are apriori unknown. It should not follow from this that the probabilities of NY can be added to LA. To do so is to suggests that the probabilities are apriori known and unknown at the same time, which is a contradiction.
A good question one could ask is what the difference between ‘collapsing’ states and adding probabilities, and whether it affects the above analysis. Much like the ‘Sure-Thing Principle’, states with identical outcomes are collapsable into one because the probability component is irrelevant; regardless of the likelihood of each state is, the outcomes are the same. I think that is why, in this example, collapsing NY/LA into (NY or LA) is permissible but adding probabilities without an apriori known origin is not. This would suggest that LA and NY should first collapse into one state because of its identical outcomes, and then as a result of the apriori unknown (and unknowable) probabilities of states P and (LA or NY), should 1⁄2 and 1⁄2 then be assigned to these states.
I believe this is where Peterson’s application of these two principles fall short, and contradicts his own application. With this is mind, this would suggest the correct application would be to first use MS (premise 1 of MS) on (LA or NY), then treat that as one “state”, then assign 1⁄2, 1⁄2 probabilities to P and (LA or NY). To do so otherwise would be to contradict IR and MS.
I think this would be the my way of explaining the seemingly ‘paradoxical’ outcome of Peterson’s example; careful reading would suggest that his application is no way compatible with the initial axioms.
Please refer to (Peterson 2009) page 33-35 on An Introduction to Decision Theory (Second Edition) for more reading.
Kind Regards,
Derek
As far as I can tell, this is just the standard complaint about the (naive?) Principle of Indifference and doesn’t have much to do with decision theory per se. E.g., here’s Keynes talking about a similar case. The most plausible solutions I know of are to either 1. insist that there simply are no rational constraints besides the axioms of probability on how we should weight the various possibilities in the absence of evidence and hence the problem is underdetermined (it depends on our “arbitrary” priors), or 2. accept that this is a real problem with Bayesian epistemology and hope something better comes along that doesn’t model all doxastic attitudes as probabilities.
Or, I suppose, 3. Tells us how to actually calculate some priors. That would be fine too.
I don’t know of any plausible, objective, truly general methods of calculating priors. Solomonoff induction or whatever isn’t going to help very much.
Solomonoff induction (or similar) gives you your priors on your first day in the world.
You get a zillion updates after that that bear on the question of where Julia Roberts is most likely to be.
Could you tell me how to use Solomonoff induction to estimate the prior probability of Julia Roberts being in New York vs. LA vs America vs Paris?
Solomonoff induction lets you calculate priors for observing any finite stream of sense data. Pick a reference machine, enumerate the ways in which you might learn about her location and off you go.
O.K., let’s imagine I’ve enumerated all the ways E1, E2, E3, … in which I could learn about Julia Roberts’ location. What do I do now?
Read up about Solomonoff induction, by the sound of it. I gave you one link, and here is another one. You will need to use a computable approximation.
I’m familiar with Solomonoff induction. I don’t think it can be used to do what you want it to do (though open to be convinced otherwise), which is why I’m trying to ask you to spell out in detail how you think the highly formal mathematical machinery could be applied in principle to a real-world case like this one. In particular, I’m trying to ascertain how exactly you bridge the gap—in a general way—between the purely syntactic algorithmic complexity of a sequence of English letters and the relative probability of the statement that sequence semantically represents.
There is no “gap-bridging”. Solomonoff induction gives you the probability of a sequence of symbols. In practice that is typically applied to the sense data streams of agents (such as produced by camera or microphone), to give estimates of their probabilities. Solomonoff induction knows nothing of semantics—it just works on symbol sequences.
Yes, and that’s the source of the problem I was attempting to get at. Solomonoff induction works on sequences of symbols. Julia Roberts being in New York is not a sequence of symbols, although “Julia Roberts being in New York” is. The correct “epistemic” prior probability of the former is not simply synonymous with the “algorithmic” probability of generating the latter, at least not in the way that “bachelor” is synonymous with “unmarried male.” The question therefore is how the two are related, and it seems like the relationship you’re proposing is that they’re equal. But that’s a really bad rule, I think, because we don’t want the probability of Julia Roberts’ location to vary with our language or orthography. So we need something more sophisticated, which is what I’m asking you for.
You started out with:
IMO, Solomonoff induction is pretty plausible, objective and general—though it does inevitably depend on a choice of language, and has the minor issue of being uncomputable. Your objections appear to be expecting too much of it. The point of my first reply was not so much to point at Solomonoff induction, but rather to emplhasize alll the subsequent updates pertaining to the issue—which in a case like this would swamp the prior.
I definitely think it’s plausible in some cases, particularly certain mathematical ones. However, I don’t see any reason whatsoever to imagine that our meagre updates swamp the prior for something like Julia Roberts’ location across most/all languages.
Well, the Solomonoff prior is pretty hopeless in this case. On the other hand, many know what language Julia Roberts speaks, where she is from and how big and celebrity-friendly these cities are. Experience gives us most information on this issue, I feel.
Is there some reason to suspect there isn’t some crazy, gerrymandered orthography such that those facts don’t swamp the priors? Or that, in general, for any two incompatible claims X and Y together with our evidence E, there aren’t two finitely specified orthographies which 1. differ in the relative algorithmic prior probabilities of the translations of X and Y into the orthographies and 2. have this difference survive conditionalizing on E? Because if so, we’re still stuck with a really nasty relativism if Solomonoff is the last word on priors.
There are certainly pathological reference machines—but that is only an issue if people use them.
Well, I already agreed that Solomonoff induction depends on a choice of language. There are not too many arguments over this, though—people can usually agree on some simple reference machine.
It seems like you’re saying that, pragmatically speaking, it’s not a problem if we all settle on the same set of formalisms. But I don’t see how that’s relevant to my point, which is that there’s no real objective constraints on the formalism we use, and what’s more, any given formalism could lead to virtually any prior between 0 and 1 for any proposition. So, as I said earlier, Solomonoff doesn’t help very much in objectively guiding our priors. We could just dispense with this Solomonoff business entirely and say, “The problem of priors isn’t an issue if we all just arbitrarily choose the same priors!”
Sure there are. Use a sufficiently far out reference machine and things go haywire, and you no-longer get a useful implementaion of Occam’s razor.
Not really: in many cases, if the proposition and language are selected, everyone agrees on the result.
Solomonoff induction is just a formalisation of Occam’s razor, which IMO, is very useful for selecting priors.
Key word there being “useful.” “Useful” doesn’t translate to “objectively correct.” Lots of totally arbitrarily set priors are useful, I’m sure, so if that’s your standard, then this whole discussion is again redundant. Anyway, the fact that Occam’s razor-as-we-intuit-it falls out of one arbitrary configuration of the paramaters (reference machine, language and orthography) of the theory isn’t in itself evidence that the theory is amazingly useful, or even particularly true. It could just be evidence that the theory is particularly vulnerable to gerrymandering, and could theoretically be configured to support virtually anything. There is, I believe, a certain polynomial inequality that characterizes the set of primes. But that turns out not to be so interesting, since every set of integers corresponds to a similar such equation.