Why does it matter how the conditional probabilities are calculated? It’s not as if you could get a different answer by calculating it differently. No matter how you do the calculations, the probability of Box A containing a million dollars is higher if you one-box than if you two-box.
You can get different answers. P(O|a) and P(O|do(a)) are calculated differently, and lead to different recommended actions in many models.
Other models are better at making this distinction, since the difference between EDT and CDT in Newcomb’s problem seems to boil down to the treatment of causality that flows backwards in time, rather than difference in calculation of probabilities. If you read the linked conversation, IlyaShpitser brings up a medical example that should make things clearer.
The English explanation is that P(O|a) is “the probability of outcome O given that we observe the action is a” and P(O|do(a)) is “the probability of outcome O given that we set the action to a.”
The first works by conditioning; basically, you go through the probability table, throw out all of the cases where the action isn’t a, and then renormalize.
The second works by severing causal links that point in to the modified node, while maintaining causal links pointing out of the modified node. Then you use this new severed subgraph to calculate a new joint probability distribution (for only the cases where the action is a).
The practical difference shows up mostly in cases where some environmental variable influences the action. If you condition on observing a, that means you make a Bayesian update, which means you can think your decision influences unmeasured variables which could have impacted your decision (because correlation is symmetric). For example, suppose you’re uncertain how serious your illness is, but you know that seriousness of illness is positively correlated with going to the hospital. Then, as part of your decision whether or not to go to the hospital, your model tells you that going to the hospital would make your illness be more serious because it would make your illness seem more serious.
The defense of EDT is generally that of course the decision-maker would intuitively know which correlations are inside the correct reference class and which aren’t. This defense breaks down if you want to implement the decision-making as computer algorithms, where programming in intuition is an open problem, or you want to use complicated interventions in complicated graphs where intuition is not strong enough to reliably get the correct answer.
The benefit of do(a) is that it’s an algorithmic way of encoding asymmetric causality assumptions, such that lesion-> smoke means we think learning about the lesion tells us about whether or not someone will smoke, and learning whether or not someone smoked tells us about whether or not they have the lesion, but changing someone from a smoker to a non-smoker (or the other way around) will not impact whether or not they have a lesion, while directly changing whether or not someone has the lesion will change how likely they are to smoke. We can algorithmically create the correct reference class for any given intervention into a causal network, which is the severed subgraph I mentioned earlier, with the do() operator.
p(a | do(b)) = p(a) if b is not an ancestor of a in a causal graph.
p(a | do(b)) = sum{pa(b)} p(a | b, pa(b)) p(pa(b)) if b is an ancestor of a in a causal DAG (pa(b) are the parents/direct causes of b in same). The idea is p(b | pa(b)) represents how b varies based on its direct causes pa(b). An intervention do(b) tells b to ignore its causes and become just a value we set. So we drop out p(b | pa(b)) from the factorization, and marginalize everything except b out. This is called “truncated factorization” or “g-formula.”
If your causal DAG has hidden variables, there is sometimes no way to express p(a | do(b)) as a function of the observed marginal, and sometimes there is. You can read my thesis, or Judea’s book for details if you are curious. For example if your causal DAG is:
b → c → a with a hidden common cause h of b and a, then
If you forget about causality, and view the g-formula rules above as a statistical calculus, you get something interesting, but that’s a separate story :).
It doesn’t look to me like you’re doing EDT with a causal model. It looks to me like you’re redefining | so that CDT is expressed with the symbols normally used to accept EDT.
The notation p(a | do(b)) is due to Pearl, and it does redefine what the conditioning bar means, although the notation is not really ambiguous.(*) You can also do things like p(a | do(b), c) = p(a,c | do(b)) / p(c | do(b)). Lauritzen writes p(a | do(b)) as p(a || b). Robins writes p(a | do(b)) as p(a | g = b) (actually Robins was first, so it’s more fair to say Pearl writes the latter as the former). The potential outcome people write p(a | do(b)) as p(A_b = a) or p(A(b) = a).
The point is, do(.) and conditioning aren’t the same.
(*) The problem with the do(.) notation is you cannot express things like p(A(b) | B = b’), which is known in some circles as “the effect of treatment on the (un)treated,” and more general kinds of counterfactuals, but this is a discussion for another time. I prefer the potential outcome notation myself.
What do you mean by “busted”? It lets you get $1,000,000 in Newcomb’s problem, which is $999,000 more than CDT gets you.
From past comments on the subject by this user it roughly translates to “CDT is rational. We evaluate decision theories based on whether they are rational. EDT does not produce the same results as CDT therefore EDT is busted.”
From past comments on the subject by this user it roughly translates to “CDT is rational. We evaluate decision
theories based on whether they are rational. EDT does not produce the same results as CDT therefore EDT is
busted.”
If this is what you got from my comments on EDT and CDT, you really haven’t been paying attention.
How about a more concrete example: what’s the difference between observing that I one-box and setting that I one-box?
Without a specified causal graph for Newcomb’s, this is difficult to describe. (The difference is way easier to explain in non-Newcomb’s situations, I think, like the Smoker’s Lesion, where everyone agrees on the causal graph and the joint probability table.)
Suppose we adopt the graph Prediction ← Algorithm → Box, where you choose your algorithm, which perfectly determines both Omega’s Prediction and which Boxes you take. Omega reads your algorithm, fills the box accordingly, but then before you can make your choice Professor X comes along and takes control of you, which Omega did not predict. Professor X can force you to one-box or two-box, but that won’t adjust Omega’s prediction of you (and thus which boxes are filled). Professor X might realistically expect that he could make you two-box and receive all the money, whereas you could not expect that, because you know that two-boxing means that Omega would predict that you two-boxed.
(Notice that this is different from the interpretation in which Omega can see the future, which has a causal graph like Box → Prediction, in which case you cannot surprise Omega.)
P(A|B) = P(A&B)/P(B). That is the definition of conditional probability. You appear to be doing something else.
That’s what I’m describing, but apparently not clearly enough. P(A&B) was what I meant by the ‘probability of A once we throw out all cases where it isn’t B’, renormalized by dividing by P(B).
So, do(x) refers to someone else making the decision for you? Newcomb’s problem doesn’t traditionally have a “let Professor X mind-control you” option.
(Notice that this is different from the interpretation in which Omega can see the future, which has a causal graph like Box → Prediction, in which case you cannot surprise Omega.)
In your case, you cannot surprise Omega either. Only Professor X can.
So, do(x) refers to someone else making the decision for you?
Generally, no. Newcomb’s is weird, and so examples using it will be weird.
It may be clearer to imagine a scenario where there is a default value for some node, which may depend on other variables in the system, and that you could intervene to adjust it from the default to some other value you prefer.
For example, suppose you had a button that toggles whether a fire alarm is ringing. Suppose the fire alarm is not perfectly reliable, so that sometimes it rings when there isn’t a fire, and sometimes when there’s a fire it doesn’t ring. It’s very different for you to observe that the alarm is off, and then switch the alarm on, and for you to observe that the alarm is on.
If an EDT system only has two nodes, “fire” (which is unobserved) and “alarm” (which is observed), then it doesn’t have a way to distinguish between the alarm switching on its own (when we should update our estimate of fire) and the alarm switching because we pressed the button (when we shouldn’t update our estimate of fire). We could fix that by adding in a “button” node, or by switching to a causal network where fire points to alarm but alarm doesn’t point to fire. In general, the second approach is better because it lacks degrees of freedom which it should not have (and because many graph-based techniques scale in complexity based on the number of nodes, whereas making the edges directed generally reduces the complexity, I think). It’s also agnostic to how we intervene, which allows for us to use one graph to contemplate many interventions, rather than having a clear-cut delineation between decision and nature nodes.
In your case, you cannot surprise Omega either. Only Professor X can.
Right; I meant to convey that in the Omega sees the future case, not even Professor X can surprise Omega.
Hopefully, you can tell the difference between an alarm you triggered and an alarm that you did not.
I can, and you can, but imagine that we’re trying to program a robot to make decisions in our place, and we can’t trust the robot to have our intuition.* Suppose we give it a utility function that prefers there not being a fire to there being a fire, but don’t give it control over its epistemology (so it can’t just alter its beliefs so it never believes in fires).
If we program it to choose actions which maximize P(O|a) in the two-node system, it’ll shut off the alarm in the hopes that it will make a fire less likely. If we program it to choose actions which maximize P(O|do(a)), it won’t make that mistake.
* People have built-in decision theories for simple problems, and so it often seems strange to demo decision theories on problems small enough that the answer is obvious. But a major point of mathematical decision theories is to enable algorithmic computation of the correct decision in very complicated systems. Medical diagnosis causal graphs can have hundreds, if not thousands, of nodes- and the impact on the network of adjusting some variables might be totally nonobvious. Maybe some symptoms are such that treating them has no effect on the progress of the disorder, whereas other symptoms do have an effect on the progress of the disorder, and there might be symptoms that treating them makes it slightly more likely that the disorder will be cured, but significantly less likely that we can tell if the disorder is cured, and so calculating whether or not that tradeoff is worth it is potentially very complicated.
I can, and you can, but imagine that we’re trying to program a robot to make decisions in our place, and we can’t trust the robot to have our intuition.
A robot would always be able to tell if it’s an alarm it triggered. Humans are the ones that are bad at it. Did you actually decide to smoke because EDT is broken, or are you just justifying it like that and you’re actually doing it because you have smoking lesions?
If we program it to choose actions which maximize P(O|a) in the two-node system, it’ll shut off the alarm in the hopes that it will make a fire less likely.
Once it knows its sensor readings, knowing whether or not it triggers the alarm is no further evidence for or against a fire.
You can get different answers. P(O|a) and P(O|do(a)) are calculated differently, and lead to different recommended actions in many models.
Other models are better at making this distinction, since the difference between EDT and CDT in Newcomb’s problem seems to boil down to the treatment of causality that flows backwards in time, rather than difference in calculation of probabilities. If you read the linked conversation, IlyaShpitser brings up a medical example that should make things clearer.
What’s the difference between a, and do(a)?
The English explanation is that P(O|a) is “the probability of outcome O given that we observe the action is a” and P(O|do(a)) is “the probability of outcome O given that we set the action to a.”
The first works by conditioning; basically, you go through the probability table, throw out all of the cases where the action isn’t a, and then renormalize.
The second works by severing causal links that point in to the modified node, while maintaining causal links pointing out of the modified node. Then you use this new severed subgraph to calculate a new joint probability distribution (for only the cases where the action is a).
The practical difference shows up mostly in cases where some environmental variable influences the action. If you condition on observing a, that means you make a Bayesian update, which means you can think your decision influences unmeasured variables which could have impacted your decision (because correlation is symmetric). For example, suppose you’re uncertain how serious your illness is, but you know that seriousness of illness is positively correlated with going to the hospital. Then, as part of your decision whether or not to go to the hospital, your model tells you that going to the hospital would make your illness be more serious because it would make your illness seem more serious.
The defense of EDT is generally that of course the decision-maker would intuitively know which correlations are inside the correct reference class and which aren’t. This defense breaks down if you want to implement the decision-making as computer algorithms, where programming in intuition is an open problem, or you want to use complicated interventions in complicated graphs where intuition is not strong enough to reliably get the correct answer.
The benefit of do(a) is that it’s an algorithmic way of encoding asymmetric causality assumptions, such that lesion-> smoke means we think learning about the lesion tells us about whether or not someone will smoke, and learning whether or not someone smoked tells us about whether or not they have the lesion, but changing someone from a smoker to a non-smoker (or the other way around) will not impact whether or not they have a lesion, while directly changing whether or not someone has the lesion will change how likely they are to smoke. We can algorithmically create the correct reference class for any given intervention into a causal network, which is the severed subgraph I mentioned earlier, with the do() operator.
How about a more concrete example: what’s the difference between observing that I one-box and setting that I one-box?
P(A|B) = P(A&B)/P(B). That is the definition of conditional probability. You appear to be doing something else.
p(a | do(b)) = p(a) if b is not an ancestor of a in a causal graph.
p(a | do(b)) = sum{pa(b)} p(a | b, pa(b)) p(pa(b)) if b is an ancestor of a in a causal DAG (pa(b) are the parents/direct causes of b in same). The idea is p(b | pa(b)) represents how b varies based on its direct causes pa(b). An intervention do(b) tells b to ignore its causes and become just a value we set. So we drop out p(b | pa(b)) from the factorization, and marginalize everything except b out. This is called “truncated factorization” or “g-formula.”
If your causal DAG has hidden variables, there is sometimes no way to express p(a | do(b)) as a function of the observed marginal, and sometimes there is. You can read my thesis, or Judea’s book for details if you are curious. For example if your causal DAG is:
b → c → a with a hidden common cause h of b and a, then
p(a | do(b)) = sum{c} p(c | b) sum{b’} p(a | c, b’) p(b’)
If you forget about causality, and view the g-formula rules above as a statistical calculus, you get something interesting, but that’s a separate story :).
What is pa(X)?
It doesn’t look to me like you’re doing EDT with a causal model. It looks to me like you’re redefining | so that CDT is expressed with the symbols normally used to accept EDT.
I am doing CDT. I wouldn’t dream of doing EDT because EDT is busted :).
In the wikipedia article on CDT:
http://en.wikipedia.org/wiki/Causal_decision_theory
p(A > Oj) is referring to p(Oj | do(A)).
The notation p(a | do(b)) is due to Pearl, and it does redefine what the conditioning bar means, although the notation is not really ambiguous.(*) You can also do things like p(a | do(b), c) = p(a,c | do(b)) / p(c | do(b)). Lauritzen writes p(a | do(b)) as p(a || b). Robins writes p(a | do(b)) as p(a | g = b) (actually Robins was first, so it’s more fair to say Pearl writes the latter as the former). The potential outcome people write p(a | do(b)) as p(A_b = a) or p(A(b) = a).
The point is, do(.) and conditioning aren’t the same.
(*) The problem with the do(.) notation is you cannot express things like p(A(b) | B = b’), which is known in some circles as “the effect of treatment on the (un)treated,” and more general kinds of counterfactuals, but this is a discussion for another time. I prefer the potential outcome notation myself.
The OP implied that EDT becomes CDT if a certain model is used.
What do you mean by “busted”? It lets you get $1,000,000 in Newcomb’s problem, which is $999,000 more than CDT gets you.
Yes. I think the OP is “wrong.” Or rather, the OP makes the distinction between EDT and CDT meaningless.
I mean that it doesn’t work properly, much like a stopped clock.
Wasn’t the OP saying that there wasn’t a distinction between EDT and CDT?
If you want to get money when you encounter Newcomb’s problem, you get more if you use EDT than CDT. Doesn’t this imply that EDT works better?
Sure, in the same sense that a stopped clock pointing to 12 is better than a running clock that is five minutes fast, when it is midnight.
From past comments on the subject by this user it roughly translates to “CDT is rational. We evaluate decision theories based on whether they are rational. EDT does not produce the same results as CDT therefore EDT is busted.”
“Busted” = “does the wrong thing.”
If this is what you got from my comments on EDT and CDT, you really haven’t been paying attention.
Without a specified causal graph for Newcomb’s, this is difficult to describe. (The difference is way easier to explain in non-Newcomb’s situations, I think, like the Smoker’s Lesion, where everyone agrees on the causal graph and the joint probability table.)
Suppose we adopt the graph Prediction ← Algorithm → Box, where you choose your algorithm, which perfectly determines both Omega’s Prediction and which Boxes you take. Omega reads your algorithm, fills the box accordingly, but then before you can make your choice Professor X comes along and takes control of you, which Omega did not predict. Professor X can force you to one-box or two-box, but that won’t adjust Omega’s prediction of you (and thus which boxes are filled). Professor X might realistically expect that he could make you two-box and receive all the money, whereas you could not expect that, because you know that two-boxing means that Omega would predict that you two-boxed.
(Notice that this is different from the interpretation in which Omega can see the future, which has a causal graph like Box → Prediction, in which case you cannot surprise Omega.)
That’s what I’m describing, but apparently not clearly enough. P(A&B) was what I meant by the ‘probability of A once we throw out all cases where it isn’t B’, renormalized by dividing by P(B).
So, do(x) refers to someone else making the decision for you? Newcomb’s problem doesn’t traditionally have a “let Professor X mind-control you” option.
In your case, you cannot surprise Omega either. Only Professor X can.
Generally, no. Newcomb’s is weird, and so examples using it will be weird.
It may be clearer to imagine a scenario where there is a default value for some node, which may depend on other variables in the system, and that you could intervene to adjust it from the default to some other value you prefer.
For example, suppose you had a button that toggles whether a fire alarm is ringing. Suppose the fire alarm is not perfectly reliable, so that sometimes it rings when there isn’t a fire, and sometimes when there’s a fire it doesn’t ring. It’s very different for you to observe that the alarm is off, and then switch the alarm on, and for you to observe that the alarm is on.
If an EDT system only has two nodes, “fire” (which is unobserved) and “alarm” (which is observed), then it doesn’t have a way to distinguish between the alarm switching on its own (when we should update our estimate of fire) and the alarm switching because we pressed the button (when we shouldn’t update our estimate of fire). We could fix that by adding in a “button” node, or by switching to a causal network where fire points to alarm but alarm doesn’t point to fire. In general, the second approach is better because it lacks degrees of freedom which it should not have (and because many graph-based techniques scale in complexity based on the number of nodes, whereas making the edges directed generally reduces the complexity, I think). It’s also agnostic to how we intervene, which allows for us to use one graph to contemplate many interventions, rather than having a clear-cut delineation between decision and nature nodes.
Right; I meant to convey that in the Omega sees the future case, not even Professor X can surprise Omega.
Hopefully, you can tell the difference between an alarm you triggered and an alarm that you did not.
I can, and you can, but imagine that we’re trying to program a robot to make decisions in our place, and we can’t trust the robot to have our intuition.* Suppose we give it a utility function that prefers there not being a fire to there being a fire, but don’t give it control over its epistemology (so it can’t just alter its beliefs so it never believes in fires).
If we program it to choose actions which maximize P(O|a) in the two-node system, it’ll shut off the alarm in the hopes that it will make a fire less likely. If we program it to choose actions which maximize P(O|do(a)), it won’t make that mistake.
* People have built-in decision theories for simple problems, and so it often seems strange to demo decision theories on problems small enough that the answer is obvious. But a major point of mathematical decision theories is to enable algorithmic computation of the correct decision in very complicated systems. Medical diagnosis causal graphs can have hundreds, if not thousands, of nodes- and the impact on the network of adjusting some variables might be totally nonobvious. Maybe some symptoms are such that treating them has no effect on the progress of the disorder, whereas other symptoms do have an effect on the progress of the disorder, and there might be symptoms that treating them makes it slightly more likely that the disorder will be cured, but significantly less likely that we can tell if the disorder is cured, and so calculating whether or not that tradeoff is worth it is potentially very complicated.
A robot would always be able to tell if it’s an alarm it triggered. Humans are the ones that are bad at it. Did you actually decide to smoke because EDT is broken, or are you just justifying it like that and you’re actually doing it because you have smoking lesions?
Once it knows its sensor readings, knowing whether or not it triggers the alarm is no further evidence for or against a fire.