A causal model to me is a set of joint distributions defined over potential outcome random variables.
Huh?
Can you expand on this, with special attention to the difference between the model and the result of a model, and to the differences from plain-vanilla Bayesian models which will also produce joint distributions over outcomes.
Sure. Here’s the world’s simplest causal graph: A → B.
Rubin et al, who do not like graphs, will instead talk about a joint distribution:
p(A, B(a=1), B(a=0))
where B(a=1) means ‘random variable B under intervention do(a=1)’. Assume binary A for simplicity here.
A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] } The causal model for this graph would be:
{ p(A, B(a=1), B(a=0) | B(a=1) is independent of A, and B(a=0) is independent of A }
These assumptions are called ‘ignorability assumptions’ in the literature, and they correspond to the absence of confounding between A and B. Note that it took counterfactuals to define what ‘absence of confounding’ means.
A regular Bayesian network model for this graph is just the set of densities over A and B (since this graph has no d-separation statements). That is, it is the set { p(A,B) | [no assumptions] }. This is a ‘statistical model,’ because it is a set of regular old joint densities, with no mention of counterfactuals or interventions anywhere.
The same graph can correspond to very different things, you have to specify.
You could also have assumptions corresponding to “missing graph edges.” For example, in the instrumental variable graph:
Z → A → B, with A ← U → B, where we do not see U, we would have an assumption that states that B(a,z) = B(a,z’) for all a,z,z’.
Please don’t say “Bayesian model” when you mean “Bayesian network.” People really should say “belief networks” or “statistical DAG models” to avoid confusion.
Please don’t say “Bayesian model” when you mean “Bayesian network.”
I do not mean “Bayesian networks”. I mean Bayesian models of the kind e.g. described in Gelman’s Bayesian Data Analysis.
p(A, B(a=1), B(a=0)) where B(a=1) means ‘random variable B under intervention do(a=1)’. Assume binary A for simplicity here.
You still can express this as plain-vanilla conditional densities, can’t you? “under intervention do(a=1)” is just a different way of saying “conditional on A=1″, no?
A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] }
and
with no mention of counterfactuals or interventions anywhere.
I don’t see counterfactuals in your set of densities and how “interventions” are different from conditionality?
You still can express this as plain-vanilla conditional densities, can’t you?
No. If conditioning was the same as interventions I could make it rain by watering my lawn and become a world class athlete by putting on a gold medal.
Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely. Of course you might say that “making the grass wet” and “seeing the grass wet” is not the same thing, in which case I agree!
The fact that these are not the same thing is why people say conditioning and interventions are not the same thing.
You can of course say that you can still use the language of conditional probability to talk about “doing events” vs “seeing events.” But then you are just reinventing interventions (as will become apparent if you try to figure out axioms for your notation).
Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely.
That’s a strawman. The conditional probability we’re talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).
But then you are just reinventing interventions
Talking about conditional probability was widespread long before people started talking about interventions.
It seems to me that the language of interventions, etc. is just a formalism that is convenient for certain types of analysis, but I’m not seeing that it means anything new.
That’s a strawman. The conditional probability we’re talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).
You seem to be missing Ilya’s point. He was arguing that if you regard “under intervention do(A = 1)” as equivalent to “conditional on A = 1″ (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet). But these are not in fact equivalent, and adding temporal ordering in there doesn’t make them equivalent either. P(rain in the past | do(wet grass) in the present) = P(rain in the past), but P(rain in the past | wet grass in the present) != P(rain in the past) .
He was arguing that if you regard “under intervention do(A = 1)” as equivalent to “conditional on A = 1″ (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet).
There is obviously a difference between observational data and experiments.
There is obviously a difference between observational data and experiments.
Yes! The difference is that experiments involve intervention. I thought the necessity of formalizing the notion of intervention is precisely what was under dispute here.
Well, kinda. I am not sure whether the final output—the joint densities of outcomes—will be different in a causal model compared to a properly specified conventional model.
To continue with the same example, it suffers from the expression “wet grass” meaning two different things—either “I see wet grass” or “I made grass wet”. This is your difference between just (a=1) and do(a=1) -- but conventional non-causal modeling doesn’t have huge problems with this, it is fully aware of the difference.
And I don’t know if it’s necessary to formalize intervention. I freely concede that it’s useful in certain areas but not so sure that’s true for all areas.
Well, kinda. I am not sure whether the final output—the joint densities of outcomes—will be different in a causal model compared to a properly specified conventional model.
So, we could add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention. So you would talk about P(rain|grass is wet, ~I made it rain, ~I made the grass wet) vs. P(rain|grass is wet, ~I made it rain, I made the grass wet). But this means doubling the number of nodes in the dataset (which, since the number of probabilities is exponential in the number of nodes for a discrete dataset, is a terrible idea). You also might want to throw in a lot of consistency constraints which are not guaranteed to hold in an arbitrary graph, which makes things more awkward.
It is much simpler, conceptually and practically, to just have a rule to determine how interventions differ from observations in updating the state of the graph, that is, talking about P(rain|grass is wet) vs. P(rain|do(grass is wet)).
So, we could add a node to the graph for every single node, which corresponds to whether or not that node was
the subject of an intervention.
In fact, Phil Dawid does precisely this. What he ends up with is still interventions. (Of course he (I think!) does not believe in counterfactuals, but that is a long discussion.)
So, we could add a node to the graph for every single node
That assumes we’re doing graphs and networks.
My problems in this subthread really started when the causal model was defined as “a set of joint distributions defined over potential outcome random variables”—notice how nothing like networks or interventions is mentioned here—and I got curious why a plain-vanilla Bayesian model which also produces a set of joint distributions doesn’t qualify.
Sorry this is a response to an old comment, but this is an easy to clarify question.
A potential outcome Y(a) is a random variable under an intervention, e.g. Y under do(a). It’s just a different notation from a different branch of statistics.
We may or may not choose to use graphs to represent causality (or indeed probability). Some people like graphs, others do not. Graphs do not add anything, they are just a visual representation.
I agree with pragmatist’s explanation. But let me add a bit more detail to illustrate that a temporal ordering will not save you here. Imagine instead of two variables we have three variables : rain (R), my grass being wet (G1), and my neighbor’s grass being wet (G2). Clearly R preceeds both G1, and G2, and G1 and G2 are contemporaneous. In fact, we can even consider G2 to be my neighbor’s grass 1 hour in the future (so clearly G1 preceeds G2!).
Also clearly, p(R = yes | G1 = wet) is high, and p(R = yes | G2 = wet) is high, also p(G1 = wet | R = yes) is high, and p(G2 = wet | R = yes) is high.
So by hosing my grass I am making it more likely than my neighbor’s grass one hour from now will be wet?
Yeah, well, I’ve heard somewhere that correlation does not equal causation :-)
I agree that causal models are useful—if only because they make explicit certain relationships which are implicit in plain-vanilla regular models and so trip up people on a regular basis.What I’m not convinced of is that you can’t re-express that joint density on the outcomes in a conventional way even if it turns out to look a bit awkward.
Lumifer : “can we not express cause effect relationships via conditioning probabilities?”
me : “No: [example].”
Lumifer : “Ah, but this is silly because of time ordering information.”
me : “Time ordering doesn’t matter: [slight modification of example].”
Lumifer : “Yeah… causal models are useful, but it’s not clear they cannot be expressed via conditioning probabilities.”
I guess you can lead a horse to water, but you can’t make him drink. I have given you everything, all you have to do is update and move on. Or not, it’s up to you.
Huh?
Can you expand on this, with special attention to the difference between the model and the result of a model, and to the differences from plain-vanilla Bayesian models which will also produce joint distributions over outcomes.
Sure. Here’s the world’s simplest causal graph: A → B.
Rubin et al, who do not like graphs, will instead talk about a joint distribution:
p(A, B(a=1), B(a=0))
where B(a=1) means ‘random variable B under intervention do(a=1)’. Assume binary A for simplicity here.
A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] } The causal model for this graph would be:
{ p(A, B(a=1), B(a=0) | B(a=1) is independent of A, and B(a=0) is independent of A }
These assumptions are called ‘ignorability assumptions’ in the literature, and they correspond to the absence of confounding between A and B. Note that it took counterfactuals to define what ‘absence of confounding’ means.
A regular Bayesian network model for this graph is just the set of densities over A and B (since this graph has no d-separation statements). That is, it is the set { p(A,B) | [no assumptions] }. This is a ‘statistical model,’ because it is a set of regular old joint densities, with no mention of counterfactuals or interventions anywhere.
The same graph can correspond to very different things, you have to specify.
You could also have assumptions corresponding to “missing graph edges.” For example, in the instrumental variable graph:
Z → A → B, with A ← U → B, where we do not see U, we would have an assumption that states that B(a,z) = B(a,z’) for all a,z,z’.
Please don’t say “Bayesian model” when you mean “Bayesian network.” People really should say “belief networks” or “statistical DAG models” to avoid confusion.
I do not mean “Bayesian networks”. I mean Bayesian models of the kind e.g. described in Gelman’s Bayesian Data Analysis.
You still can express this as plain-vanilla conditional densities, can’t you? “under intervention do(a=1)” is just a different way of saying “conditional on A=1″, no?
and
I don’t see counterfactuals in your set of densities and how “interventions” are different from conditionality?
No. If conditioning was the same as interventions I could make it rain by watering my lawn and become a world class athlete by putting on a gold medal.
I don’t understand—can you unroll?
Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely. Of course you might say that “making the grass wet” and “seeing the grass wet” is not the same thing, in which case I agree!
The fact that these are not the same thing is why people say conditioning and interventions are not the same thing.
You can of course say that you can still use the language of conditional probability to talk about “doing events” vs “seeing events.” But then you are just reinventing interventions (as will become apparent if you try to figure out axioms for your notation).
That’s a strawman. The conditional probability we’re talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).
Talking about conditional probability was widespread long before people started talking about interventions.
It seems to me that the language of interventions, etc. is just a formalism that is convenient for certain types of analysis, but I’m not seeing that it means anything new.
You seem to be missing Ilya’s point. He was arguing that if you regard “under intervention do(A = 1)” as equivalent to “conditional on A = 1″ (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet). But these are not in fact equivalent, and adding temporal ordering in there doesn’t make them equivalent either. P(rain in the past | do(wet grass) in the present) = P(rain in the past), but P(rain in the past | wet grass in the present) != P(rain in the past) .
There is obviously a difference between observational data and experiments.
No, because they’re modeling different reality.
Yes! The difference is that experiments involve intervention. I thought the necessity of formalizing the notion of intervention is precisely what was under dispute here.
Well, kinda. I am not sure whether the final output—the joint densities of outcomes—will be different in a causal model compared to a properly specified conventional model.
To continue with the same example, it suffers from the expression “wet grass” meaning two different things—either “I see wet grass” or “I made grass wet”. This is your difference between just (a=1) and do(a=1) -- but conventional non-causal modeling doesn’t have huge problems with this, it is fully aware of the difference.
And I don’t know if it’s necessary to formalize intervention. I freely concede that it’s useful in certain areas but not so sure that’s true for all areas.
So, we could add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention. So you would talk about P(rain|grass is wet, ~I made it rain, ~I made the grass wet) vs. P(rain|grass is wet, ~I made it rain, I made the grass wet). But this means doubling the number of nodes in the dataset (which, since the number of probabilities is exponential in the number of nodes for a discrete dataset, is a terrible idea). You also might want to throw in a lot of consistency constraints which are not guaranteed to hold in an arbitrary graph, which makes things more awkward.
It is much simpler, conceptually and practically, to just have a rule to determine how interventions differ from observations in updating the state of the graph, that is, talking about P(rain|grass is wet) vs. P(rain|do(grass is wet)).
In fact, Phil Dawid does precisely this. What he ends up with is still interventions. (Of course he (I think!) does not believe in counterfactuals, but that is a long discussion.)
That assumes we’re doing graphs and networks.
My problems in this subthread really started when the causal model was defined as “a set of joint distributions defined over potential outcome random variables”—notice how nothing like networks or interventions is mentioned here—and I got curious why a plain-vanilla Bayesian model which also produces a set of joint distributions doesn’t qualify.
It probably just was a bad definition.
Sorry this is a response to an old comment, but this is an easy to clarify question.
A potential outcome Y(a) is a random variable under an intervention, e.g. Y under do(a). It’s just a different notation from a different branch of statistics.
We may or may not choose to use graphs to represent causality (or indeed probability). Some people like graphs, others do not. Graphs do not add anything, they are just a visual representation.
I agree with pragmatist’s explanation. But let me add a bit more detail to illustrate that a temporal ordering will not save you here. Imagine instead of two variables we have three variables : rain (R), my grass being wet (G1), and my neighbor’s grass being wet (G2). Clearly R preceeds both G1, and G2, and G1 and G2 are contemporaneous. In fact, we can even consider G2 to be my neighbor’s grass 1 hour in the future (so clearly G1 preceeds G2!).
Also clearly, p(R = yes | G1 = wet) is high, and p(R = yes | G2 = wet) is high, also p(G1 = wet | R = yes) is high, and p(G2 = wet | R = yes) is high.
So by hosing my grass I am making it more likely than my neighbor’s grass one hour from now will be wet?
Or, to be more succinct : http://www.smbc-comics.com/index.php?db=comics&id=1994#comic
Yeah, well, I’ve heard somewhere that correlation does not equal causation :-)
I agree that causal models are useful—if only because they make explicit certain relationships which are implicit in plain-vanilla regular models and so trip up people on a regular basis.What I’m not convinced of is that you can’t re-express that joint density on the outcomes in a conventional way even if it turns out to look a bit awkward.
Here’s how this conversation played out.
Lumifer : “can we not express cause effect relationships via conditioning probabilities?”
me : “No: [example].”
Lumifer : “Ah, but this is silly because of time ordering information.”
me : “Time ordering doesn’t matter: [slight modification of example].”
Lumifer : “Yeah… causal models are useful, but it’s not clear they cannot be expressed via conditioning probabilities.”
I guess you can lead a horse to water, but you can’t make him drink. I have given you everything, all you have to do is update and move on. Or not, it’s up to you.
Yes, I’m a picky sort of a horse :-) Thanks for the effort, though.