Random variables and Evidential Decision Theory
This post is inspired by the recent discussion I had with IlyaShpitser and Vaniver on EDT.
A random variable only ever has one value
In probability theory, statistics and so on, we often use the notion of a random variable (RV). If you go look at the definition, you will see that a RV is a function of the sample space. What that means is that a RV assigns a value to each possible outcome of a system. In reality, where there are no closed systems, this means that a RV assigns a value to each possible universe.
For example, a random variable X representing the outcome a die roll is a function of type “Universe → {1..6}”. The value of X in a particular universe u is then X(u). Uncertainty in X corresponds to uncertainty about the universe we are in. Since X is a pure mathematical function, its value is fixed for each input. That means that in a fixed universe, say our universe, such a random variable only ever takes on one value.
So, before the die roll, the value of X is undefined1, and after the roll X is forever fixed. X is the outcome of a certain particular roll. If I roll the same die again, that doesn’t change the value of X. If you want to talk about multiple rolls, you have to use different variables. The usual solution is to use indices, X1, X2, etc.
This also means that the nodes in a causal model, are not random variables. For example in the causal model “Smoking → Cancer”, there is no single RV for smoking. Rather, the model is implicitly a generalized to mean “Smokingi → Canceri” for all persons i.
What this means for EDT
It is sometimes claimed that Evidential Decision Theory (EDT) can not deal with causal structure. But I would disagree. To avoid confusion, I will refer to my interpretation as Estimated Evidential Decision Theory (EEDT).
Decision theories such as (E)EDT rely on the following formula to make decisions:
where oj are the possible outcomes, U(oj) is the utility of an outcome, O is a random variable that represents the actual outcome, and a is an action. The (E)EDT policy is to take the action that maximizes V(a), the value of that action.
How would you evaluate this formula in practice? To do that, you need to know P(O=oj | a). I.e. the probability of a certain outcome given that you take a certain action. But keep in mind the previous section! There is only one random variable O, which is the outcome of this action. Without assuming some prior knowledge, O is unrelated to the outcome of other similar actions in similar situations.
At the time an agent has to decide what action a to take, the action has not happened yet, and the outcome is not yet known to him. This means that the agent has no observations of O. The agent therefore has to estimate P(O=oj|a) by using only his prior knowledge. How this estimation is done exactly is not specified by EEDT. If the agent wants to use a causal model, he is perfectly free to do so!
You might argue that by not specifying how the conditional probabilities P(O=oj|a) are calculated, I have taken out the interesting part of the decision theory. With the right choice of estimation procedure, EEDT can describe CDT, normal/naive EDT, and even UDT2. But EEDT is not so general as to be completely useless. What it does give you is a way to reduce the problem of making decisions to that of estimating conditional probabilities.
Footnotes
1. Technically, ‘undefined’ is not in the domain of X. What I mean is that X is a partial function of universes, or a function only of universes in which the die has been rolled.
2. To get CDT, assume there is a causal model for A → O, and use that to estimate P(O=oj | do A=a). To get naive EDT, estimate the probabilities from data without taking causality or confounders into account. To get UDT, model A as being the choice of all sufficiently similar agents, not just yourself.
Look, just go read about causal models. You are confused about very basic things.
In my HAART example I gave you p(O=oj | a) explicitly. In that example using the EDT formula results in going to jail.
Not in my understanding. What you gave was P(O’=oj | a’), which looks similar, but talks about different RVs. That is the point I was trying to make by saying that “a random variable only ever has one value”.
Fair enough, I will do some reading when I have the time. Do you have any pointers to minimize the amount I have to read, or should I just read all of Pearl’s book?
ETA:
A (real valued) random variable X is a function with type “X : Ω → R”, where Ω is the sample space. There are two ways to treat causal models:
Each node X represents a random variable X. Different instances (e.g. patients) correspond to different samples Ω.
Each node X represents a sequence of random variables X_i. Different patients correspond to different indices. The sample space Ω contains the entire real world, or at least all possible patients as well as the agent itself.
Interpretation 1 is the standard one, I think. I was advocating the second view when I said that nodes are not random variables. I suppose I could have been more clear.
“Most of the causal inference community” agrees that causal models are made up of potential outcomes, which on the unit level are propositional logical variables that determine how some “unit” (person, etc.) responds in a particular way (Y) to a hypothetical intervention on the direct causes of Y. If we don’t know which unit we are talking about, we average over them to get a random variable Y(pa(Y)). This view is almost a century old now (Neyman, 1923), and is a classic view in statistics.
I think it’s fine if you want to advocate a “new view” on things. I am just worried that you might be suffering from a standard LW disease of trying to be novel without adequately understanding the state of play, and why the state of play is the way it is.
At the end of the day, CDT is a “model,” and “all models are wrong.” However, it gives the right answer to the HAART question, and moreover the only way to give the right answer to these kinds of questions is to be isomorphic to the “CDT algorithm” for these kinds of questions.
Is Y a particular way of responding (e.g. Y = “the person dies”), or is it a variable that denotes whether the person responds in that way (e.g. Y=1 if the person dies and 0 otherwise)? I think you meant the latter.
How does averaging over propositional logical variables give you a random variable? I am afraid I am getting confused by your terminology.
I wasn’t trying to be novel for the sake of it. Rather, I was just trying to write down my thoughts on the subject. As I said before, if you have some specific pointers to the state of the art in this field, then that would be much appreciated. Note that I have a background in computer science and machine learning, so I am somewhat familiar with causal models
That sounds interesting. Do you have a link to a proof of this statement?
The latter.
There is uncertainty about which unit u we are talking about (given by some p(u) we do not see). So instead of a propositional variable assignment Y(pa(y), u) = y, we have an event with a probability p{ Y(pa(y)) = y } = \sum{u : Y(pa(y),u) = y } p(u).
I am not sure I made a formal enough statement to prove. I guess:
(a) if you believe that your domain is acyclic causal, and
(b) you know what the causal structure is, and
(c) your utility is a function of the outcomes sitting in your causal system, and
(d) your actions on a variable embedded in your causal system break causal links operating from usual direct causes to the variable, and
(e) your domain isn’t “crazy” enough to demand adjustments along the lines of TDT,
then the right thing to do is to use CDT.
These preconditions hold in the HAART example. I am not sure exactly how to formalize (e) (I am not sure anyone does, this is a part of what is open).
Basically, yes. What separates EDT and CDT is whether they condition on the joint probability distribution or use the do operator on the causal graph; there’s no other difference. This is a productive difference, and so obscuring it is counterproductive.
What you call “EEDT” I would call “expected value calculation,” and then I would use “decision theory” to describe the different ways of estimating conditional probabilities (as you put it). It is right that expected value calculation is potentially nonobvious, and so saying “we use expected values” is meaningful and important, but I think that the language to convey the concepts you want to convey already exists, and you should use the language that already exists.
An update: your comment (among others) prompted me to do some more reading. In particular, the Stanford Encyclopedia of Philosophy article on Causal Decision Theory was very helpful in making the distinction between CDT and EDT clear to me.
I still think you can mess around with the notion of random variables, as described in this post, to get a better understanding of what you are actually prediction. But I suppose that this can be confusing to others.
I’m glad it helped!
I think there are three steps in the reduction:
A “decision problem”: “given my knowledge, what action should I take”, which is answered by a “decision theory/procedure”
Expected values: “given my knowledge, what is the expected value of taking action A”
Conditional probability estimation: “given my knowledge, what is my best guess for the probability P(X|Y)”
The reduction from 1 to 2 is fairly obvious, just take the action that maximizes expected value. I think this is common to all decision theories.
The reduction of 2 to 3 is what is done by EEDT. To me this step was not so obvious, but perhaps there is a better name for it.
So the difference is in how to solve step 3, I agree. I wasn’t trying to obscure anything, of course. Rather, I was trying to advocate that we should focus on problem 3 directly, instead of problem 1.
Do you have some more standard terms I should use?
This doesn’t hold in prospect theory, in which probabilities are scaled before they are used. (Prospect theory is a descriptive theory of buggy human decision-making, not a prescriptive decision theory.) [Edit] Actually, I think this is a disagreement in how you go from 2 to 3. In order to get a disagreement on how you go from 1 to 2, you’d have to look at the normal criticisms of VNM.
It’s also not universal in decision theories involving intelligent adversaries. Especially in zero-sum games, you see lots of minimax solutions (minimize my opponent’s maximum gain). This difference is somewhat superficial, because minimax does reduce to expected value calculation with the assumption that the opponent will always choose the best option available to him by our estimation of his utility function. If you relax that assumption, then you’re back in EV territory, but now you need to elicit probabilities on the actions of an intelligent adversary, which is a difficult problem.
There’s a subtle point here about whether the explicit mechanisms matter. Expected value calculation is flexible enough that any decision procedure can be expressed in terms of expected value calculation, but many implicit assumptions in EV calculation can be violated by those procedures. For example, an implicit assumption of EV calculation is consequentialism in the sense that the preferences are over outcomes, but one could imagine a decision procedure where the preferences are over actions instead of over outcomes, or more realistically action-outcome pairs. We can redefine the outcomes to include the actions, and thus rescue the procedure, but it seems worthwhile to have a separation between consequentialist and deontological positions. If you relax the implicit assumption that probabilities should be treated linearly, then prospect theory can be reformulated as an EV calculation, where outcomes are outcome-probability pairs. But again it seems worthwhile to distinguish between decision algorithms which treat probabilities linearly and nonlinearly.
I’m not sure how much we disagree at this point, if you see the difference between EDT and CDT as a disagreement about problem 3.
Why does it matter how the conditional probabilities are calculated? It’s not as if you could get a different answer by calculating it differently. No matter how you do the calculations, the probability of Box A containing a million dollars is higher if you one-box than if you two-box.
If you use a causal model, you figure that certain brain configurations will cause you to one-box, so one-boxing is evidence for having one of these brain configurations. These brain configurations will also cause omega to conclude that you’ll one-box and put the million dollars in Box A. Thus, having one of these brain configurations is evidence that Box A has a million dollars. Together, this means that one-boxing is evidence that Box A contains a million dollars.
You can get different answers. P(O|a) and P(O|do(a)) are calculated differently, and lead to different recommended actions in many models.
Other models are better at making this distinction, since the difference between EDT and CDT in Newcomb’s problem seems to boil down to the treatment of causality that flows backwards in time, rather than difference in calculation of probabilities. If you read the linked conversation, IlyaShpitser brings up a medical example that should make things clearer.
What’s the difference between a, and do(a)?
The English explanation is that P(O|a) is “the probability of outcome O given that we observe the action is a” and P(O|do(a)) is “the probability of outcome O given that we set the action to a.”
The first works by conditioning; basically, you go through the probability table, throw out all of the cases where the action isn’t a, and then renormalize.
The second works by severing causal links that point in to the modified node, while maintaining causal links pointing out of the modified node. Then you use this new severed subgraph to calculate a new joint probability distribution (for only the cases where the action is a).
The practical difference shows up mostly in cases where some environmental variable influences the action. If you condition on observing a, that means you make a Bayesian update, which means you can think your decision influences unmeasured variables which could have impacted your decision (because correlation is symmetric). For example, suppose you’re uncertain how serious your illness is, but you know that seriousness of illness is positively correlated with going to the hospital. Then, as part of your decision whether or not to go to the hospital, your model tells you that going to the hospital would make your illness be more serious because it would make your illness seem more serious.
The defense of EDT is generally that of course the decision-maker would intuitively know which correlations are inside the correct reference class and which aren’t. This defense breaks down if you want to implement the decision-making as computer algorithms, where programming in intuition is an open problem, or you want to use complicated interventions in complicated graphs where intuition is not strong enough to reliably get the correct answer.
The benefit of do(a) is that it’s an algorithmic way of encoding asymmetric causality assumptions, such that lesion-> smoke means we think learning about the lesion tells us about whether or not someone will smoke, and learning whether or not someone smoked tells us about whether or not they have the lesion, but changing someone from a smoker to a non-smoker (or the other way around) will not impact whether or not they have a lesion, while directly changing whether or not someone has the lesion will change how likely they are to smoke. We can algorithmically create the correct reference class for any given intervention into a causal network, which is the severed subgraph I mentioned earlier, with the do() operator.
How about a more concrete example: what’s the difference between observing that I one-box and setting that I one-box?
P(A|B) = P(A&B)/P(B). That is the definition of conditional probability. You appear to be doing something else.
p(a | do(b)) = p(a) if b is not an ancestor of a in a causal graph.
p(a | do(b)) = sum{pa(b)} p(a | b, pa(b)) p(pa(b)) if b is an ancestor of a in a causal DAG (pa(b) are the parents/direct causes of b in same). The idea is p(b | pa(b)) represents how b varies based on its direct causes pa(b). An intervention do(b) tells b to ignore its causes and become just a value we set. So we drop out p(b | pa(b)) from the factorization, and marginalize everything except b out. This is called “truncated factorization” or “g-formula.”
If your causal DAG has hidden variables, there is sometimes no way to express p(a | do(b)) as a function of the observed marginal, and sometimes there is. You can read my thesis, or Judea’s book for details if you are curious. For example if your causal DAG is:
b → c → a with a hidden common cause h of b and a, then
p(a | do(b)) = sum{c} p(c | b) sum{b’} p(a | c, b’) p(b’)
If you forget about causality, and view the g-formula rules above as a statistical calculus, you get something interesting, but that’s a separate story :).
What is pa(X)?
It doesn’t look to me like you’re doing EDT with a causal model. It looks to me like you’re redefining | so that CDT is expressed with the symbols normally used to accept EDT.
I am doing CDT. I wouldn’t dream of doing EDT because EDT is busted :).
In the wikipedia article on CDT:
http://en.wikipedia.org/wiki/Causal_decision_theory
p(A > Oj) is referring to p(Oj | do(A)).
The notation p(a | do(b)) is due to Pearl, and it does redefine what the conditioning bar means, although the notation is not really ambiguous.(*) You can also do things like p(a | do(b), c) = p(a,c | do(b)) / p(c | do(b)). Lauritzen writes p(a | do(b)) as p(a || b). Robins writes p(a | do(b)) as p(a | g = b) (actually Robins was first, so it’s more fair to say Pearl writes the latter as the former). The potential outcome people write p(a | do(b)) as p(A_b = a) or p(A(b) = a).
The point is, do(.) and conditioning aren’t the same.
(*) The problem with the do(.) notation is you cannot express things like p(A(b) | B = b’), which is known in some circles as “the effect of treatment on the (un)treated,” and more general kinds of counterfactuals, but this is a discussion for another time. I prefer the potential outcome notation myself.
The OP implied that EDT becomes CDT if a certain model is used.
What do you mean by “busted”? It lets you get $1,000,000 in Newcomb’s problem, which is $999,000 more than CDT gets you.
Yes. I think the OP is “wrong.” Or rather, the OP makes the distinction between EDT and CDT meaningless.
I mean that it doesn’t work properly, much like a stopped clock.
Wasn’t the OP saying that there wasn’t a distinction between EDT and CDT?
If you want to get money when you encounter Newcomb’s problem, you get more if you use EDT than CDT. Doesn’t this imply that EDT works better?
Sure, in the same sense that a stopped clock pointing to 12 is better than a running clock that is five minutes fast, when it is midnight.
From past comments on the subject by this user it roughly translates to “CDT is rational. We evaluate decision theories based on whether they are rational. EDT does not produce the same results as CDT therefore EDT is busted.”
“Busted” = “does the wrong thing.”
If this is what you got from my comments on EDT and CDT, you really haven’t been paying attention.
Without a specified causal graph for Newcomb’s, this is difficult to describe. (The difference is way easier to explain in non-Newcomb’s situations, I think, like the Smoker’s Lesion, where everyone agrees on the causal graph and the joint probability table.)
Suppose we adopt the graph Prediction ← Algorithm → Box, where you choose your algorithm, which perfectly determines both Omega’s Prediction and which Boxes you take. Omega reads your algorithm, fills the box accordingly, but then before you can make your choice Professor X comes along and takes control of you, which Omega did not predict. Professor X can force you to one-box or two-box, but that won’t adjust Omega’s prediction of you (and thus which boxes are filled). Professor X might realistically expect that he could make you two-box and receive all the money, whereas you could not expect that, because you know that two-boxing means that Omega would predict that you two-boxed.
(Notice that this is different from the interpretation in which Omega can see the future, which has a causal graph like Box → Prediction, in which case you cannot surprise Omega.)
That’s what I’m describing, but apparently not clearly enough. P(A&B) was what I meant by the ‘probability of A once we throw out all cases where it isn’t B’, renormalized by dividing by P(B).
So, do(x) refers to someone else making the decision for you? Newcomb’s problem doesn’t traditionally have a “let Professor X mind-control you” option.
In your case, you cannot surprise Omega either. Only Professor X can.
Generally, no. Newcomb’s is weird, and so examples using it will be weird.
It may be clearer to imagine a scenario where there is a default value for some node, which may depend on other variables in the system, and that you could intervene to adjust it from the default to some other value you prefer.
For example, suppose you had a button that toggles whether a fire alarm is ringing. Suppose the fire alarm is not perfectly reliable, so that sometimes it rings when there isn’t a fire, and sometimes when there’s a fire it doesn’t ring. It’s very different for you to observe that the alarm is off, and then switch the alarm on, and for you to observe that the alarm is on.
If an EDT system only has two nodes, “fire” (which is unobserved) and “alarm” (which is observed), then it doesn’t have a way to distinguish between the alarm switching on its own (when we should update our estimate of fire) and the alarm switching because we pressed the button (when we shouldn’t update our estimate of fire). We could fix that by adding in a “button” node, or by switching to a causal network where fire points to alarm but alarm doesn’t point to fire. In general, the second approach is better because it lacks degrees of freedom which it should not have (and because many graph-based techniques scale in complexity based on the number of nodes, whereas making the edges directed generally reduces the complexity, I think). It’s also agnostic to how we intervene, which allows for us to use one graph to contemplate many interventions, rather than having a clear-cut delineation between decision and nature nodes.
Right; I meant to convey that in the Omega sees the future case, not even Professor X can surprise Omega.
Hopefully, you can tell the difference between an alarm you triggered and an alarm that you did not.
I can, and you can, but imagine that we’re trying to program a robot to make decisions in our place, and we can’t trust the robot to have our intuition.* Suppose we give it a utility function that prefers there not being a fire to there being a fire, but don’t give it control over its epistemology (so it can’t just alter its beliefs so it never believes in fires).
If we program it to choose actions which maximize P(O|a) in the two-node system, it’ll shut off the alarm in the hopes that it will make a fire less likely. If we program it to choose actions which maximize P(O|do(a)), it won’t make that mistake.
* People have built-in decision theories for simple problems, and so it often seems strange to demo decision theories on problems small enough that the answer is obvious. But a major point of mathematical decision theories is to enable algorithmic computation of the correct decision in very complicated systems. Medical diagnosis causal graphs can have hundreds, if not thousands, of nodes- and the impact on the network of adjusting some variables might be totally nonobvious. Maybe some symptoms are such that treating them has no effect on the progress of the disorder, whereas other symptoms do have an effect on the progress of the disorder, and there might be symptoms that treating them makes it slightly more likely that the disorder will be cured, but significantly less likely that we can tell if the disorder is cured, and so calculating whether or not that tradeoff is worth it is potentially very complicated.
A robot would always be able to tell if it’s an alarm it triggered. Humans are the ones that are bad at it. Did you actually decide to smoke because EDT is broken, or are you just justifying it like that and you’re actually doing it because you have smoking lesions?
Once it knows its sensor readings, knowing whether or not it triggers the alarm is no further evidence for or against a fire.
I was not able to follow any of your discussion.
It would be great if you could make precise what you mean by certain terms. What is a ‘closed system’ in this context? What is the definition of ‘Universe’?
A typical system studied in introductory probability theory might be a die roll. This is a system with a sample space with 6 states. It is an abstract thing, that doesn’t interact with anything else. In reality, when you roll a die, that die is part of the real world. The same world that also contains you, the earth, etc. That is what I meant by ‘universe’.
For closed systems, I was thinking of the term as used in physics:
Is this original research or have these ideas actually been fleshed out formally somewhere I can read about them?
I don’t know. That is to say: it is original research, but probably subconsciously inspired by and stolen from many other sources.
In “Universe → {1..6}”, Universe is the type of the sample space of a random variable.
in “The value of X in a particular universe u is then X(u)”, universe refers to a specific sample in the sample space.
I don’t think this is exactly what twanvl means, especially if you consider that he’s saying things like:
For a die roll, the ‘Universe’ is simply {1,2,3,4,5,6}. At least, this is the normal way it is done in probability theory.
And it still doesn’t answer what he means by ‘closed system’.
No, I’m reasonably confident that in the example the author meant ‘Universe’ to mean {all possible states of the universe after the dice role}. The random variable X is a function mapping from states of the universe to values at the top of the die. The inverse-image#Inverseimage) of {4} is the set of all states of the universe where the die landed on 4. That inverse-image defines an [event](http://en.wikipedia.org/wiki/Event(probability_theory)). The measure of that event is what we mean when we say ‘probability of the die landing on 4’.
I’m not entirely sure either. Rather than take a guess, I’ll let twanvl speak for him/herself.