In the smoking lesion variant where smoking is actually protective against cancer, but not enough to overcome the damage done by the lesion …
I tend to be sceptical of smoking lesion arguments on account of how the scenario seems be always either underspecified or contradictory. For example, how can any agents in the smoking lesion problem be EDT agents at all?
If they always take the action recommended by EDT, and there is exactly one such action, then they must all take the same action. But in that case there can’t possibly be the postulated connection between the lesion and smoking (conditional on being an EDT agent). So an EDT agent that knows it implements EDT can’t believe that its decision to smoke affects the chances of having the lesion, on pain of making incorrect predictions.
On the other hand, if “EDT agents” in this problem only sometimes take the action recommended by EDT, and the rest of the time are somehow influenced by the presence or absence of the lesion, then the description of the problem that says that the node controlled by your decision theory is “decision to smoke” would seem to be wrong to begin with. (These EDT agents will predict that P(I smoke | I smoke) = 1 and be horribly suprised.)
This is correct. You can remove the causality from a causal network and just use EDT on a joint probability distribution at the cost of increasing the number of nodes and the fan-in for each node. Since the memory requirements are exponential in fan-in and linear in number of nodes, this is a bad idea.
This is something I can believe, though it is not a correctness argument. Certainly it’s plausible that in many scenarios it is computationally more convenient to apply CDT directly than to use a fully general model that has been taught about the same structure that CDT assumes.
For example, how can any agents in the smoking lesion problem be EDT agents at all?
In the statement of the smoking lesion problem I prefer, you have lots of observational data on people whose decision theory is unknown, but whose bodies are similar enough to yours that you think the things that give or don’t give them cancer will have the same effect on you. You also don’t know whether or not you have the lesion; a sensible prior is the population prevalence of the lesion.
Now it looks like we have a few options.
We only condition on data that’s narrowly similar. Here, that might mean only conditioning on other agents who use EDT- which would result in us having no data!
We condition on data that’s broadly similar, keeping the original correlations.
We condition on data that’s broadly similar, but try to break some of the original correlations.
Option 1 is unworkable. Option 2 is what I call ‘standard EDT,’ and it fails on the smoking lesion. Option 3 is generally the one EDTers use to rescue EDT from the smoking lesion. But the issue is that EDT gives you no guidance on which of the correlations to break; you have to figure it out from the problem description. One might expect that sitting down and working out whether or not to smoke using math breaks the correlation between smoking and having the lesion, as most people don’t do that. But should we also break the negative correlation between smoking and cancer conditional on lesion status? From the English names, we can probably get those right. If they’re unlabeled columns in a matrix or nodes in a graph, we’ll have trouble.
That work still has to be done somewhere, obviously; in CDT it’s done when one condenses the problem statement down to a causal network. (And CDTers historically being wrong on Newcomb’s is an example of what doing this work wrong looks like.) But putting work where it belongs and having good interfaces between your modules is a good idea, and I think this is a place where CDT does solidly better than EDT.
Certainly it’s plausible that in many scenarios it is computationally more convenient to apply CDT directly than to use a fully general model that has been taught about the same structure that CDT assumes.
I do think the linked Graham article is well worth reading; that all languages necessarily turn into machine code does not mean all languages are equally good for thinking in. Thinking in a more powerful language lets you have more powerful thoughts.
Smoking lesion is a problem with a logical contradiction in it. The decision is simultaneously a consequence of the lesion, and of the decision theory’s output (but not one of it’s inputs, such as e.g. the desire to smoke, in which case it’s this desire that will correlate, and conditional on that desire, the decision itself won’t).
edit: smoking lesion problem seems more interesting from psychological perspective. Perhaps it is difficult to detect internal contradictions within a hypothetical that asserts an untruth—any “this smells fishy” feeling is mis-attributed to the tension between the fact of how smoking kills and the hypothetical genetics.
It could, thus, be very useful to come up with a real world example instead of using such hypotheticals.
In traditional decision theory as proposed by bayesians such as Jaynes, you always condition on all observed data. The thing that tells you whether any of this observed data is actually relevant is your model, and it does this by outputting a joint probability distribution for your situation conditional on all that data. (What I mean by “model” here is expressed in the language of probability as a prior joint distribution P(your situation × dataset | model), or equivalently a conditional distribution P(your situation | dataset, model) if you don’t care about computing the prior probabilities of your data.)
Option 2 is what I call “blindly importing related historical data as if it was a true description of your situation”. Clearly any model that says that the joint probability for your situation is identically equal to the empirical frequencies in any random data set is wrong.
From the English names, we can probably get those right. If they’re unlabeled columns in a matrix or nodes in a graph, we’ll have trouble.
The point is, it’s not about figuring stuff out from English names. It’s about having a model that correctly generalises from observed data to predictions. Unlabeled columns in a matrix are no trouble at all if your model relates them to the nodes in your personal situation in the right way.
The CDT solution of turning the problem into a causal graph and calculating probabilities with do(·) is effectively just such a model, that admittedly happens to be an elegant and convenient one. Here the information that allows you to generalise from observed data to make personal predictions is introduced when you use your human intelligence to figure out a causal graph for the situation.
Still, none of this addresses the issue that the problem itself is underspecified.
ETA: Lest you think I’ve just said that CDT is better than EDT, the point I’m trying to make here is that if you want a decision theory to generalise from data, you need to provide a model. “Your situation has the same probabilities as a causal intervention on this causal graph on that dataset, where nodes {A, B, C, …} match up to nodes {X, Y, Z, …}” is as good a model as any, and can certainly be used in EDT. The fact that EDT doesn’t come “model included” is a feature, not a bug.
Option 2 is what I call “blindly importing related historical data as if it was a true description of your situation”. Clearly any model that says that the joint probability for your situation is identically equal to the empirical frequencies in any random data set is wrong.
Agreed that this is a bad idea. I think where we disagree is that I don’t see EDT as discouraging this. It doesn’t even throw a type error when you give it blindly imported related historical data! CDT encourages you to actually think about causality before making any decisions.
It’s about having a model that correctly generalises from observed data to predictions.
Note that decision theory does actually serve a slightly different role from a general prediction module, because it should be built specifically for counterfactual reasoning. The five-and-ten argument seems to be an example of this: if while observing another agent, you see them choose $5 over $10, it could be reasonable to update towards them preferring $5 to $10. If considering the hypothetical situation where you choose $5 instead of $10, it does not make sense to update towards yourself preferring $5 to $10, or to draw whatever conclusion you like by the principle of explosion.
that admittedly happens to be an elegant and convenient one.
Given that you can emulate one system using the other, I think that elegance and convenience are the criteria we should use to choose between them. Note that emulating a joint probability without causal knowledge using a causal network is trivial- you just use undirected edges for any correlations- but emulating a causal network using a joint probability is difficult.
“Your situation has the same probabilities as a causal intervention on this causal graph on that dataset, where nodes {A, B, C, …} match up to nodes {X, Y, Z, …}” is as good a model as any, and can certainly be used in EDT. The fact that EDT doesn’t come “model included” is a feature, not a bug.
Imagine, instead of the smoking lesion, a “death paradox lesion”, Statistical analysis has shown that this lesion is associated with early death, and also that it is correlated with the ability of the agent to make correct logical decisions.
Assume you don’t want an early death. Should you conclude that you have a death paradox lesion?
There’s also the scenarion involving the EDT paradox lesion. This lesion is 1) correlated with early death, and 2) correlated with people’s use of EDT in the same way that the smoking lesion is correlated with smoking. What do you conclude and why?
I tend to be sceptical of smoking lesion arguments on account of how the scenario seems be always either underspecified or contradictory. For example, how can any agents in the smoking lesion problem be EDT agents at all?
If they always take the action recommended by EDT, and there is exactly one such action, then they must all take the same action. But in that case there can’t possibly be the postulated connection between the lesion and smoking (conditional on being an EDT agent). So an EDT agent that knows it implements EDT can’t believe that its decision to smoke affects the chances of having the lesion, on pain of making incorrect predictions.
On the other hand, if “EDT agents” in this problem only sometimes take the action recommended by EDT, and the rest of the time are somehow influenced by the presence or absence of the lesion, then the description of the problem that says that the node controlled by your decision theory is “decision to smoke” would seem to be wrong to begin with. (These EDT agents will predict that
P(I smoke | I smoke) = 1
and be horribly suprised.)This is something I can believe, though it is not a correctness argument. Certainly it’s plausible that in many scenarios it is computationally more convenient to apply CDT directly than to use a fully general model that has been taught about the same structure that CDT assumes.
In the statement of the smoking lesion problem I prefer, you have lots of observational data on people whose decision theory is unknown, but whose bodies are similar enough to yours that you think the things that give or don’t give them cancer will have the same effect on you. You also don’t know whether or not you have the lesion; a sensible prior is the population prevalence of the lesion.
Now it looks like we have a few options.
We only condition on data that’s narrowly similar. Here, that might mean only conditioning on other agents who use EDT- which would result in us having no data!
We condition on data that’s broadly similar, keeping the original correlations.
We condition on data that’s broadly similar, but try to break some of the original correlations.
Option 1 is unworkable. Option 2 is what I call ‘standard EDT,’ and it fails on the smoking lesion. Option 3 is generally the one EDTers use to rescue EDT from the smoking lesion. But the issue is that EDT gives you no guidance on which of the correlations to break; you have to figure it out from the problem description. One might expect that sitting down and working out whether or not to smoke using math breaks the correlation between smoking and having the lesion, as most people don’t do that. But should we also break the negative correlation between smoking and cancer conditional on lesion status? From the English names, we can probably get those right. If they’re unlabeled columns in a matrix or nodes in a graph, we’ll have trouble.
That work still has to be done somewhere, obviously; in CDT it’s done when one condenses the problem statement down to a causal network. (And CDTers historically being wrong on Newcomb’s is an example of what doing this work wrong looks like.) But putting work where it belongs and having good interfaces between your modules is a good idea, and I think this is a place where CDT does solidly better than EDT.
I do think the linked Graham article is well worth reading; that all languages necessarily turn into machine code does not mean all languages are equally good for thinking in. Thinking in a more powerful language lets you have more powerful thoughts.
Smoking lesion is a problem with a logical contradiction in it. The decision is simultaneously a consequence of the lesion, and of the decision theory’s output (but not one of it’s inputs, such as e.g. the desire to smoke, in which case it’s this desire that will correlate, and conditional on that desire, the decision itself won’t).
edit: smoking lesion problem seems more interesting from psychological perspective. Perhaps it is difficult to detect internal contradictions within a hypothetical that asserts an untruth—any “this smells fishy” feeling is mis-attributed to the tension between the fact of how smoking kills and the hypothetical genetics.
It could, thus, be very useful to come up with a real world example instead of using such hypotheticals.
In traditional decision theory as proposed by bayesians such as Jaynes, you always condition on all observed data. The thing that tells you whether any of this observed data is actually relevant is your model, and it does this by outputting a joint probability distribution for your situation conditional on all that data. (What I mean by “model” here is expressed in the language of probability as a prior joint distribution
P(your situation × dataset | model)
, or equivalently a conditional distributionP(your situation | dataset, model)
if you don’t care about computing the prior probabilities of your data.)Option 2 is what I call “blindly importing related historical data as if it was a true description of your situation”. Clearly any model that says that the joint probability for your situation is identically equal to the empirical frequencies in any random data set is wrong.
The point is, it’s not about figuring stuff out from English names. It’s about having a model that correctly generalises from observed data to predictions. Unlabeled columns in a matrix are no trouble at all if your model relates them to the nodes in your personal situation in the right way.
The CDT solution of turning the problem into a causal graph and calculating probabilities with
do(·)
is effectively just such a model, that admittedly happens to be an elegant and convenient one. Here the information that allows you to generalise from observed data to make personal predictions is introduced when you use your human intelligence to figure out a causal graph for the situation.Still, none of this addresses the issue that the problem itself is underspecified.
ETA: Lest you think I’ve just said that CDT is better than EDT, the point I’m trying to make here is that if you want a decision theory to generalise from data, you need to provide a model. “Your situation has the same probabilities as a causal intervention on this causal graph on that dataset, where nodes {A, B, C, …} match up to nodes {X, Y, Z, …}” is as good a model as any, and can certainly be used in EDT. The fact that EDT doesn’t come “model included” is a feature, not a bug.
Agreed that this is a bad idea. I think where we disagree is that I don’t see EDT as discouraging this. It doesn’t even throw a type error when you give it blindly imported related historical data! CDT encourages you to actually think about causality before making any decisions.
Note that decision theory does actually serve a slightly different role from a general prediction module, because it should be built specifically for counterfactual reasoning. The five-and-ten argument seems to be an example of this: if while observing another agent, you see them choose $5 over $10, it could be reasonable to update towards them preferring $5 to $10. If considering the hypothetical situation where you choose $5 instead of $10, it does not make sense to update towards yourself preferring $5 to $10, or to draw whatever conclusion you like by the principle of explosion.
Given that you can emulate one system using the other, I think that elegance and convenience are the criteria we should use to choose between them. Note that emulating a joint probability without causal knowledge using a causal network is trivial- you just use undirected edges for any correlations- but emulating a causal network using a joint probability is difficult.
Precisely.
Imagine, instead of the smoking lesion, a “death paradox lesion”, Statistical analysis has shown that this lesion is associated with early death, and also that it is correlated with the ability of the agent to make correct logical decisions.
Assume you don’t want an early death. Should you conclude that you have a death paradox lesion?
There’s also the scenarion involving the EDT paradox lesion. This lesion is 1) correlated with early death, and 2) correlated with people’s use of EDT in the same way that the smoking lesion is correlated with smoking. What do you conclude and why?
I don’t understand most of your position on EDT/CDT, but I especially don’t understand how
follows from the previous sentence.
I also thought P(A|A)=1 followed from the axioms of probability.