I really enjoyed this, and I agree that this belongs in Main. I think I’ll need to read this a few more times before I understand this well. As I was reading this, I tried to fit these ideas in with other concepts I already had. I want to make sure that I am doing this in a way that makes sense, and in particular, that I am not missing any subtleties:
I’m imagining the ‘data generating mechanism’ as the computation that computed our joint distribution. To actually make inferences about this, you have to use some a priori causal information. To me, this looks a lot like Solomonoff induction—you are trying to find the most efficient computation that could have produced the joint distribution (i.e. ‘Which variable was assigned first?’ could be answered by ‘Look at all the programs that could have computed the distribution, and find the most efficient/simplest one—your answer is whatever variable is assigned first in that program’). Does this sound right? Are there other aspects of the a priori causal information that this doesn’t capture?
The positivity assumption is like saying that you have no prior knowledge of any (significant) differences between the two groups. For each given individual, your state of information as to which group they end up in is exactly the same. So using a coin with a known bias, or a perfectly deterministic pseudo random number generator are OK to use, and we can handle a ‘random violation of positivity’ just fine—but a structural violation is bad because either we did have some information, or our priors were really off. Does that sound about right?
re: 1: a data generating process is indeed a machine that produces the data we see. But, importantly, it also contains information about all sorts of other things, in particular what sort of data would come out had you given this machine completely different inputs. The fact that this machine encodes this sort of counterfactual information is what makes it “in the causal magisterium,” so to speak.
The machine itself is presumed to be “out there.” If we are trying to learn what that machine might be, we may wish to invoke assumptions akin to Occam’s razor. But this is about us trying to learn something, not about the machine per se.
Nature is not required to be convenient!
To use an analogy our mutual friend Judea likes to use, ‘the joint distribution’ is akin to encoding how some object, say a vase, reflects light from every angle. This information is sufficient to render this vase in a computer graphics system. But this information is not sufficient to render what happens if we smash a vase—we need in addition to the surface information, also additional information about the material of the vase, how brittle it is, etc. The ‘data generating process’ would contain in addition to surface info about light reflectivity, also information that lets us deduce how a vase would react to any counterfactual deformation we might perform, whether we drop it from a table, or smash it with a hammer, or lightly nudge it.
At least the way I am conceiving a computation, I can (theoretically) run the same computation with different inputs. So I think a computation would capture that sort of counterfactual information also.
So in LW terms—beware of the mind projection fallacy.
I am not sure your understanding of positivity is right: It is simply a matter of whether there exists any stratum where nobody is treated (or nobody is untreated).
The distinction between “random” and “structural” violations was inherited from a course that did not insist as strongly on distinguishing between statistics and causal inference. I think the necessary assumption for identification is structural positivity, and that “random violation of positivity” is simply an apparent violation of the assumption due to sampling. I will update the text to make this clear.
I’ll keep thinking about your question 1. Possibly there are other people who would be better suited to answer it than me..
I really enjoyed this, and I agree that this belongs in Main. I think I’ll need to read this a few more times before I understand this well. As I was reading this, I tried to fit these ideas in with other concepts I already had. I want to make sure that I am doing this in a way that makes sense, and in particular, that I am not missing any subtleties:
I’m imagining the ‘data generating mechanism’ as the computation that computed our joint distribution. To actually make inferences about this, you have to use some a priori causal information. To me, this looks a lot like Solomonoff induction—you are trying to find the most efficient computation that could have produced the joint distribution (i.e. ‘Which variable was assigned first?’ could be answered by ‘Look at all the programs that could have computed the distribution, and find the most efficient/simplest one—your answer is whatever variable is assigned first in that program’). Does this sound right? Are there other aspects of the a priori causal information that this doesn’t capture?
The positivity assumption is like saying that you have no prior knowledge of any (significant) differences between the two groups. For each given individual, your state of information as to which group they end up in is exactly the same. So using a coin with a known bias, or a perfectly deterministic pseudo random number generator are OK to use, and we can handle a ‘random violation of positivity’ just fine—but a structural violation is bad because either we did have some information, or our priors were really off. Does that sound about right?
Thanks!
re: 1: a data generating process is indeed a machine that produces the data we see. But, importantly, it also contains information about all sorts of other things, in particular what sort of data would come out had you given this machine completely different inputs. The fact that this machine encodes this sort of counterfactual information is what makes it “in the causal magisterium,” so to speak.
The machine itself is presumed to be “out there.” If we are trying to learn what that machine might be, we may wish to invoke assumptions akin to Occam’s razor. But this is about us trying to learn something, not about the machine per se. Nature is not required to be convenient!
To use an analogy our mutual friend Judea likes to use, ‘the joint distribution’ is akin to encoding how some object, say a vase, reflects light from every angle. This information is sufficient to render this vase in a computer graphics system. But this information is not sufficient to render what happens if we smash a vase—we need in addition to the surface information, also additional information about the material of the vase, how brittle it is, etc. The ‘data generating process’ would contain in addition to surface info about light reflectivity, also information that lets us deduce how a vase would react to any counterfactual deformation we might perform, whether we drop it from a table, or smash it with a hammer, or lightly nudge it.
Thanks for your reply.
At least the way I am conceiving a computation, I can (theoretically) run the same computation with different inputs. So I think a computation would capture that sort of counterfactual information also.
So in LW terms—beware of the mind projection fallacy.
Thank you!
I am not sure your understanding of positivity is right: It is simply a matter of whether there exists any stratum where nobody is treated (or nobody is untreated).
The distinction between “random” and “structural” violations was inherited from a course that did not insist as strongly on distinguishing between statistics and causal inference. I think the necessary assumption for identification is structural positivity, and that “random violation of positivity” is simply an apparent violation of the assumption due to sampling. I will update the text to make this clear.
I’ll keep thinking about your question 1. Possibly there are other people who would be better suited to answer it than me..