An introduction to decision theory
This is part 1 of a sequence to be titled “Introduction to decision theory”.
Less Wrong collects together fascinating insights into a wide range of fields. If you understood everything in all of the blog posts, then I suspect you’d be in quite a small minority. However, a lot of readers probably do understand a lot of it. Then, there are the rest of us: The people who would love to be able to understand it but fall short. From my personal experience, I suspect that there are an especially large number of people who fall into that category when it comes to the topic of decision theory.
Decision theory underlies much of the discussion on Less Wrong and, despite buckets of helpful posts, I still spend a lot of my time scratching my head when I read, for example, Gary Drescher’s comments on Timeless Decision Theory. At it’s core this is probably because, despite reading a lot of decision theory posts, I’m not even 100% sure what causal decision theory or evidential decision theory is. Which is to say, I don’t understand the basics. I think that Less Wrong could do with a sequence that introduces the relevant decision theory from the ground up and ends with an explanation of Timeless Decision Theory (and Updateless Decision Theory). I’m going to try to write that sequence.
What is a decision theory?
In the interests of starting right from the start, I want to talk about what a decision theory is. A decision theory is a formalised system for analysing possible decisions and picking from amongst them. Normative decision theory, which this sequence will focus on, is about how we should make decisions. Descriptive decision theory is about how we do make decisions.
Decision theories involves looking at the possible outcomes of a decision. Each outcome is given a utility value, expressing how desirable that outcome is. Each outcome is also assigned a probability. The expected utility of taking an action is equal to the sum of the utilities of each possible outcome multiplied by the probability of that outcome occuring. To put it another way, you add together the utilities of each of the possible outcomes but these are weighted by the probability so that if an outcome is less likely, the value of that outcome is taken into account to a lesser extent.
Before this gets too complicated, let’s look at an example:
Let’s say you are deciding whether to cheat on a test. If you cheat, the possible outcomes are, getting full marks on the test (50% chance, 100 points of utility—one for each percentage point correct) or getting caught cheating and getting no marks (50% chance, 0 utility).
We can now calculate the expected utility of cheating on the test:
(1/2 * 100) + (1/2 * 0) = 50 + 0 = 50
That is, we look at each outcome, determine how much it should contribute to the total utility by multiplying the utility by its probability and then add together the value we get for each possible outcome.
So, decision theory would say (questions of morality aside) that you should cheat on the test if you would get less than 50% on the test if you didn’t cheat.
Those who are familiar with game theory may feel that all of this is very familiar. That’s a reasonable conclusion: A good approximation of what decision theory is that it’s one player game theory.
What are causal and evidential decision theories?
Two of the principle decision theories popular in academia at the moment are causal and evidential decision theories.
In the description above, when we looked at each action we considered two factors: The probability of it occurring and the utility gained or lost if it did occur. Causal and evidential decision theories differ by defining the probability of the outcome occurring in two different ways.
Causal Decision Theory defines this probability causally. That is to say, they ask, what is the probability that, if action A is taken, outcome B will occur. Evidential decision theory asks what evidence the action provides for the outcome. That is to say, it asks, what is the probability of B occurring given the evidence of A. These may not sound very different so let’s look at an example.
Imagine that politicians are either likeable or unlikeable (and they are simply born this way—they cannot change it) and the outcome of the election they’re involved in depends purely on whether they are likeable. Now let’s say that likeable people have a higher probability of kissing babies and unlikeable people have a lower probability of doing so. But this politician has just changed into new clothing and the baby they’re being expected to kiss looks like it might be sick. They really don’t want to kiss the baby. Kissing the baby doesn’t itself influence the election, that’s decided purely based on whether the politician is likeable or not. The politician does not know if they are likeable.
Should they kiss the baby?
Causal Decision Theory would say that they should not kiss the baby because the action has no causal effect. It would calculate the probabilities as follows:
If I am likeable, I will win the election. If I am not, I will not. I am 50% likely to be likeable.
If I don’t kiss the baby, I will be 50% likely to win the election.
If I kiss the baby, I will be 50% likely to win the election.
I don’t want to kiss the baby so I won’t.
Evidential Decision Theory on the other hand, would say that you should kiss the baby because doing so is evidence that you are likeable. It would reason as follows:
If I am likeable, I will win the election. If I am not, I will not. I am 50% likely to be likeable.
If I kissed the baby, there would be an 80% probability that I was likeable (to choose an arbitrary percentage).
If I did not kiss the baby, there would be a 20% probability that I was likeable.
Therefore:
Given the action of me kissing the baby, it is 80% probable that I am likeable and thus the probability of me winning the election is 80%.
Given the action of me not kissing the baby, it is 20% probable that I am likeable and thus the probability of me winning the election is 20%.
So I should kiss the baby (presuming the desire to avoid kissing the baby is only a minor desire).
This is making it explicit but the basic point is this: Evidential Decision Theory asks whether an action provides evidence for the probability of an outcome occuring, Causal Decision Theory asks whether the action will causally effect the probability of an outcome occuring.
The question of whether either of these decision theories works under all circumstances that we’d want them to is the topic that will be explored in the next few posts of this sequences.
Appendix 1: Some maths
I think that when discussing a mathematical topic, there’s always something to be gained from having a basic knowledge of the actual mathematical equations underpinning it. If you’re not comfortable with maths though, feel free to skip the following section. Each post I do will, if relevant, end with a section on the maths behind it but these will always be separate to the main body of the post – you will not need to know the equations to understand the rest of the post. If you’re interested in the equations though, read on:
Decision theory assigns each action a utility based on the sum of the probability of each outcome multiplied by the utility from each possible outcome. It then applies this equation to each possible action to determine which one leads to the highest utility. As an equation, this can be represented as:
Where U(A) is the utility gained from action A. Capital sigma, the Greek letter, represents the sum for all i, Pi represents the probability of outcome i occurring and Di, standing for desirability, represents the utility gained if that outcome occurred. Look back at the cheating on the test example to get an idea of how this works in practice if you’re confused.
Now causal and evidential decision theory differ based on how they calculate Pi. Causal Decision Theory uses the following equation:
In this equation, everything is the same as in the first equation except, in the section referring to probability is, the probability is calculated as the probability of Oi occurring if action A is taken.
Similarly, Evidential Decision Theory uses the following equation:
Where the probability is calculated based on the probability of Oi given that A is true.
If you can’t see the distinction between these two equations, then think back to the politician example.
Appendix 2: Important Notes
The question of how causality should be formalised is still an open one, see cousin_it’s comments below. As an introductory level post, we will not delve into these questions here but it is worth noting their is some debate on how exactly to interpret causal decision theory.
It’s also worth noting that the baby kissing example mentioned above is more commonly discussed on the site as the Smoking Lesion problem. In the smoking lesion world, people who smoke are much more likely to get cancer. But smoking doesn’t actually cause cancer, rather there’s a genetic lesion that can cause both cancer and people to smoke. If you like to smoke (but really don’t like cancer), should you smoke. Once again, Causal Decision Theory says yes. Evidencial Decision Theory says no.
The next post is “Newcombe’s Problem: A problem for Causal Decision Theories”.
The act of adopting EDT increases the probability that, given that you kiss the baby, you are unlikable (because the only unlikable politicians that would baby-kiss are those that adopt EDT.) So EDT does nothing for you except increase the probability of getting baby barf all over your suit.
One could counterargue and say: “the listed probabilities facing the EDT politician already take into account that she’s EDT!” But the numbers can’t be coherently interpreted that way if adopting EDT has a causal effect on your choices, which one would hope it does; it can’t be the case that 80% of EDT kissers and 20% of EDT nonkissers are likable, and that 50% of EDT politicians are likable unless only half the EDTs go ahead with the kiss.
A simpler proof: suppose 400 politicians, 100 in each quadrant. (You have to shift around the expected probabilities to make this work, and I’m too lazy to do that, but I dont believe you have to do so in any way that’s damaging to the basic decision structure.) Should the 200 EDT politicians be more likable than the 200 CDT politicians? Ex hypothesi, no! If there’s an argument for EDT it would have to show that the likable politicians are more likely to adopt EDT. (And maybe this is the case because of motivated cognition—the unlikable politicians searches for a justification to not kiss, and adopts CDT; the likable searches for a justification to kiss, and adopts EDT—but my intuition is that this doesn’t demonstrate this in the relevant sense, since the superiority of EDT should work with rational agents without motivated cognition or with reversed motivated cognition.)
Re: “EDT does nothing for you except increase the probability of getting baby barf all over your suit.”
If the description is right—but is it? Why isn’t the evidence relating to the supplied fact—that kissing the baby doesn’t itself influence the election—being given much weight?
Because it’s a simplifying assumption of the model. In the real world politicians do kiss babies because there is some uncertainty (even if elections are almost always a function of unemployment.)
I’d like to hear more of your thoughts. How much greater is the influence of unemployment than, say, the ability of the incumbent powers to create a real or perceived external threat?
What I should have said is that the vast majority variation in election outcomes can be explained by incumbency, the absolute levels of unemployment, and recent trends in unemployment. (If you’re comparing across jurisdictions, throw in demographics as a simple shock in one party’s direction.) This is a weaker claim than the one I irresponsibly made—“incumbency” includes effects that work both ways (for instance, Presidential incumbency is a barrier to his party in mid-term elections, while some incumbency advantage may be tracking the sort of threats you mention.) Additionally, campaign tactics can end up mattering in the peripheral case of one candidate simply not bothering—witness the case of Martha Coakley. But since most campaigns are run at at least the minimal level of competence necessary to avoid this, campaign competence is generally irrelevant. And, finally, things that genuinely matter are captured by the unemployment effect—for instance, fundraising and volunteer hours. (In the case of fundraising, the effect is double—people hungry for change/happy with the status quo send money to effect that, and smart corporations invest in the likely winner.)
There are definite effects for “real or percieved threats”—it’s called the “rally-round-the-flag effect”—but it comes up in such a small portion of elections that you can generally ignore it until it, well, comes up. But when it matters at all it matters.
This obviously applies less to primaries of non-incumbent parties, since the unemployment factor is controlled. There is where donors, strategy, and candidate-specific factors play the most important role. And I’m not as familiar with non-US contexts.
The truth behind all these caveats is that almost all campaign coverage you seen in the US—minute analysis of the events of the last news cycle—is for entertainment value only. Joe Biden having a gaffe or Obama giving a moving speech on race relations is not going to effect anything in the least. They cover things in this silly way because the alternative would be bad for ratings.
That doesn’t seem like a good reason for ignoring relevant evidence to the point where you do a stupid thing. Does using evidential decision theory really lead to being barfed on by babies this easily?
I hope your posts (after dealing with the theory/theories) will carry on to the situations in which they may and may not be useful. Things like: how are the outcomes found, how are the utilities estimated and how are the probabilities estimated? What decisions are easier or harder to model with these theories? It would be nice to have some examples ranging from something easy like whether now is the time to buy a car to something difficult like when is the right time to put your mother in a care home. I don’t mean that the decision needs to be hard but its the modeling difficulty that interests me. To me the question that immediately follows ‘is the theory logical?’ is ‘is the theory usable?’
I am voting up and looking forward to the next bit.
Thanks for the positive comments. I’ll certainly try to tackle those issues in the sequences. As you note, the first question I’m going to look at is, “Is the theory logical?” but after that I’ll certainly tackle the question of usability.
I applaud anyone who figures out stuff for themselves and posts it for the benefit of others, but this post is extremely unclear. How do you define these funky “causal probabilities” to someone who only knows regular conditional probabilities? And how can kissing the baby be evidence for anything, if it’s determined entirely by which decision theory you adopt? In short, I feel your “explanation” doesn’t look inside black boxes, only shuffles them around. I’d prefer a more formal treatment to ensure that no lions lurk in the shadows.
Are you referring to the fact that evidential decision theories rely on conditional probability, working as follows:
Modelling probability as ways the world could be (ie. if the world can be two ways and A is true in one of them then it’s 50% probable).
Imagine the world has ten ways to be, A is true in five of them, B is true in six but A and B are only true in 2. So our probability of A is 5⁄10 = 1⁄2 Our probability of A given B is 2⁄6 = 1⁄3 Because: B being true reduced the number of ways the world can be to six. Of these, in only 2 is A true.
So, the answer to the question about how kissing the baby is evidence is as above: It’s evidence because it rules out some possible ways that the world could be. For example, there may originally have been ten worlds, five where the politician was likeable and five where they weren’t. The politician kisses the baby in only one where he’s unlikeable and in four where he’s likeable. Kissing the baby then reduces it down to five ways the world could be and in four of these he’s likeable so by kissing the baby, the probability is higher that the world is such that he is likeable.
I don’t expect that to be a breakthrough to you. I’m just asking whether that’s the sort of thing you were thinking I should have said (but better crafted).
As to the difference between conditional and causal probability, causal probabilities would be a subset of the conditional ones where “A causes B”. What it means to say “A causes B” seems beyond the scope of an introductory article to me though. Or am I missing what you mean? Is there a simple way to explain what a causal probability is at an introductory level?
I think I’ve made it obvious in my postings here that I consider I have a lot to learn. If you think you can see a way I should have done it, I’d be really interested to know what it was and I could try and edit the post or write another post to explain it.
IMO, right now decision theory is not a settled topic to write tutorials like this about. You might say that I was dissatisfied with the tone of your post: it implied that somewhere there are wise mathematicians who know what “causal probabilities” mean, etc. In truth there are no such wise mathematicians. (Well, you could mention Judea Pearl as a first approximation, but AFAIK his work doesn’t settle the issues completely, and he uses a very different formalism.) Any honest introduction should clearly demarcate the dubious parts with “here be dragons”. When I started learning decision theory, introductions like yours wasted a huge amount of my time and frustrated me to no end, because they always seemed to assume things that just weren’t there in the specialized literature.
You sound like someone who is in a position to write a great intro to DT. Would you consider doing that, or perhaps collaborating with this post’s author?
That would feel a bit like hijacking. The obvious candidates for writing such a post are Eliezer, Gary Drescher, or Wei Dai. I don’t know why they aren’t doing that, probably they feel that surveying the existing literature is quite enough to make yourself confused. I’ll think about your suggestion, though. If I find an honest and accessible angle, I’ll write a post.
I’m not writing a tutorial on decision theory because I think a simple informal understanding of expected utility maximization is sufficient for almost all practical decision making, and people who are interested in the technical details, or want to work on things like anthropic reasoning or Newcomb’s problem can easily find existing material on EDT and CDT. (I personally used the book that I linked to earlier. And I think it is useful to survey the existing literature, if only to confirm that a problem exists and hasn’t already been solved.)
But anyway, Adam Bell seems to be doing a reasonable job of explaining EDT and CDT to a non-technical audience. If he is successful, it might give people a better idea of what it is that we’re actually trying to accomplish with TDT and UDT.
Looks like this is being addressed:
http://lesswrong.com/lw/2lg/desirable_dispositions_and_rational_actions/2gec?c=1
Also note an addition to the post: Appendix 2. I don’t feel like going into these details here would benefit all beginners (it may benefit some but disadvantage others) but you’re right that I can at least signpost that there is an issue and people who want more details can get a bit of a start from reading these comments.
Fair enough. I understand what you’re saying and it’s ashame that this sort of introduction caused problems when you were learning decision theory. However, I feel like this is just the sort of thing that I did need to help me learn decision theory. Sometimes you need to have a flawed but simple understand before you can appreciate the flaws in your understanding. I’m the sort of person that would probably never get there if I was expected to see the issues in the naive presentation straight away.
Maybe this introduction won’t be suitable for everyone but I don’t feel like these issues mean it will be useful to no-one. However, I can see what you mean about at least signposting that there are unresolved issues. My current plan involves introducing various issues in decision theory so people at least understand what the discussion is about and then to do a state of play post which surveys issues in the field, unresolved issues and outlines why decision theory is still a wide open field.
That may go some way to resolving your concerns or you may just feel like such an approach is pointless. However, I do hope that this post will benefit some people and some types of learners even if it doesn’t benefit you personally or people who learn in the same way as you.
This depends on what you mean by “learn” and what objective you want to achieve by learning. I don’t believe in having a “flawed but simple understanding” of a math topic: people who say such things usually mean that they can recite some rehearsed explanations, but cannot solve even simple problems on the topic. Solving problems should come first, and intuitive explanations should come later.
Imagine you live in the middle ages and decide to study alchemy. So you start digging in, and after your first few lessons you happily decide to write an “intuitive introduction to alchemy techniques” so your successors can pass the initial phase more easily. I claim that this indicates a flawed mindset. If you cannot notice (“cannot be expected to see”, as you charmingly put it) that the whole subject doesn’t frigging work, isn’t your effort misplaced? How on Earth can you be satisfied with an “intuitive” understanding of something that you don’t even know works?
I apologize if my comments here sound rude or offensive. I’m honestly trying to attack what I see as a flawed approach you have adopted, not you personally. And I honestly think that the proper attitude to decision theory is to treat it like alchemy: a pre-paradigmatic field where you can hope to salvage some useful insights from your predecessors, but most existing work is almost certainly going to get scrapped.
No need to worry about being rude or offensive—I’m happy to talk about issues rather than people and I never thought we were doing anything different. However, I wonder if a better comparison is with someone studying “Ways of discovering the secrets of the universe.” If they studied alchemy and then looked at ways it failed that might be a useful way of seeing what a better theory of “discovering secrets” will need to avoid.
That’s my intention. Study CDT and see where it falls down so then we have a better sense of what a Decision Theory needs to do before exploring other approaches to decision theory. You might do the same with alchemy and you might explain its flaw. But first you have to explain what alchemy is before you can point out the issues with it. That’s what this post is doing—explaining what causal decision theory is seen to be before we look at the problems with this perception.
To look at alchemy’s flaws, you first need to know what alchemy is. Even if you can see it’s flawed from the start, that doesn’t mean a step by step process can’t be useful.
Or that’s how I feel. Further disagreement is welcome.
Sorry for deleting my comment—on reflection it sounded too harsh.
Maybe it’s just me, but I don’t think you’re promoting the greater good when you write an intuitive tutorial on a confused topic without screaming in confusion yourself. What’s the hurry, anyway? Why not make some little bits perfectly clear for yourself, and write then?
Here’s an example of an intuitive explanation (of an active research topic, no less) written by someone whose thinking is crystal clear: Cosma Shalizi on causal models. One document like that is worth a thousand “monad tutorials” written by Haskell newbies.
Maybe there should be a top-level post on how causal decision theory is like burritos?
I can’t believe you just wrote that. The whole burrito thing is just going to confuse people, when it’s really a very straightforward topic.
Just think of decision theory as if it were cricket...
At least that would be a change from treating decision theory as if it were all about prison.
I don’t think you’ve sounded harsh. You obviously disagree with me but I think you’ve done so politely.
I guess my feeling is that different people learn differently and I’m not as convinced as you seem to be that this is the wrong way for all people to learn (as opposed to the wrong way for some people to learn). I grant that I could be wrong on this but I feel that I, at the very least, would gain something from this sort of tutorial. Open to be proven wrong if there’s a chorus of dissenters.
Obviously, I could write a better explanation of decision theory if I had researched the area for years and had a better grasp of it. However, that’s not the case, so I’m left to decide what should do given the experience I do have.
I am writing this hoping that doing so will benefit some people.
And doing so doesn’t stop me writing a better tutorial when I do understand the topic better. I can still do that when that time occurs and yet create something that hopefully has positive value for now.
Thx for the Shalizi link. I’m currently slogging my way through Pearl, and Shalizi clarifies things.
At first I thought that AdamBell had invented Evidential Decision Theory from whole cloth, but I discover by Googling that it really exists. Presumably it makes sense for different problems—it certainly did not for the baby-kissing story as presented.
As far as I know, there’s still no non-trivial formalization of the baby-kissing problem (aka Smoking Lesion). I’d be happy to be proved wrong on that.
That’s not Adam Bell’s fault. Those black boxes are inherent in CDT. You can read a CDT proponent’s formal treatment of causal probabilities here, and see for yourself.
The main differences that I see between EDT and CDT are, first, that EDT is misnamed. The equations do not use “evidence” in the common sense meaning of that term. They use it in the way people deploying statistics tend, erroneously, to use it—as presuming causes when the only valid interpretation of the numbers treats them as correlations. The probability of B, given A, P(B|A), can be due to A causing B (which is how we are to understand the arrow in Bell’s equations in Appendix 1); B causing A; C causing both; A and B happening in periodic cycles that tend to align (e.g. monthly menstruation and full moons); A and B being in part-whole relationships; in container-contained relationships; and many many other correlated relationships; or due to various combinations thereof. Given all of such possible relationships, it then becomes a mistake to believe that doing A will result in (e.g. cause) B, just because the probability of event B occurring, given that A has occurred, is high. The problem is well-illustrated by the fallaciousness of the reasoning in EDT that recommends that politicians kiss babies, given the set up supposed in Bell’s example.
The second major difference, brought out by the first, is that standard statistical treatments of probabilities are based on frequencies, rather than propensities, as interpretations of the numbers. As such, the assumptions made in such theories work well in depicting areas of investigation that are like tosses of dice, drawing cards, pulling balls from urns, and so on, but tend to go awry when trying to depict causation, or isolate signals from noise in real cases, as when conceptually ill-equiped investigators think that, because they have found a reliable correlation, they have found a causal influence rather than a correlation due to noise. For example, in order to establish a good p-value, they compare their results to what would happen under a random distribution, and treat the resulting p-value as a good basis on which to claim they have established a prima facie case (a “finding”) that A “influences” B. The problem is that showing that a correlation isn’t due to random distribution doesn’t show much. Given all the other options mentioned above, all of which are potentially part of the background noise, what investigators need for their stronger claims is a comparison with a natural distribution, which tends not to be random, and contains all the potential confounders that explain the correlation differently.
This should, I hope, bring out some of what is at issue here. Decision theory, ideally, gives precise symbolic form and apt guidance for decision making dilemmas. But to do so, it needs its numeric and symbolic representations to be good maps of the terrain, which they aren’t, yet, if “EDT” is being used. In short, if you can ignore causation, and causation’s arrow, EDT will suit you fine. The problem is that most of the time you can’t.
Causal Decision Theory works fine for baby-kissing. It’s just slightly harder to apply than EDT.
In CDT you must consider the causality from like-ability to election victory.
One of strands of that causality is: Likeability->Appearance of likeability(via baby-kissing etc)->election victory
Baby kissing has a direct impact on your election chances; there is a clear causal chain.