1) This is fantastic- I keep meaning to read more on how to actually apply Highly Advanced Epistemology to real data, and now I’m learning about it. Thanks!
2) This should be on Main.
3) Does there exist an alternative in the literature to the notation of Pr(A = a)? I hadn’t realized until now how much the use of the equal sign there makes no sense. In standard usage, the equal sign either refers to literal equivalence (or isomorphism) as in functional programming, or variable assignment, as in imperative programming.
This operation is obviously not literal equivalence (the set A is not equal to the element a), and it’s only sort of like variable assignment. We do not erase our previous data of the set A: we want it to be around when we talk about observing other events from the set A.
In analogy with Pearl’s “do” notation, I propose that we have an “observe notation”, where Pr(A = a) would be written as Pr(obs_A (a)), and read as “probability that event a is observed from set A,” and not overload our precious equal sign. (The overloading with equivalence vs. variable assignment is already stressful enough for the poor piece of notation.)
I’m not proposing that you change your notation for this sequence, but I feel like this notation might serve for clearer pedagogy in general.
Good notation for interventions has to permit easy nesting and conflicts for [ good reasons I don’t want to get into right now ]. do(.) actually isn’t very good for this reason (and I have deprecated it in my own work). I like various flavors of the potential outcome notation, e.g. Y(a) to mean “response Y under intervention do(a)”. Ander uses Y^a (with a superscript) for the same thing.
With potential outcomes we can easily express things like “what would happen to Y if A were forced to a, and M were forced to whatever value M attained had A been instead forced to a’ “: Y(a,M(a’)). You can’t even write this down with do(.).
The A=a notation always bugged me too.
I like the above notation because it betrays morphism composition.
If we consider random variables as measure(able) spaces and conditional probabilities P(B | A) as stochastic maps B → P(A), then every element ‘a’ of (a countably generated) A induces a point measure → A giving probability 1 to that event. This is the map named by do(a). But since we’re composing maps, not elements, we can use an element a unambiguously to mean its point measure. Then a series of measures separated by ‘,’ give the product measure.
In the above example, let a : A (implicitly, → A), a’ : B (implicitly, → B), M : B ~> C, Y : (A,C) ~> D, then Y(a,M(a’)) is a stochastic map ~> D given by composition
EDIT: How do I ascii art?
All of this is a fancy way of saying that “potential outcome” notation conveys exactly the right information to make probabilities behave nicely.
Yes, one of the reasons I am not very fond of subscript or superscript notation (that to be fair is very commonly used) is because it quickly becomes awkward to nest things, and I personally often end up nesting things many level deep. Parentheses is the only thing I found that works acceptably well.
If you think of interventions as a morphism, then it is indeed very natural to think in terms of arbitrary function composition, which leads one to the usual functional notation. The reason people in the causal inference community perhaps do not find this as natural as a mathematician would is because it is difficult to interpret things like Y(a,M(a’)) as idealized experiments we could actually perform. There is a strong custom in the community (a healthy one in my opinion, because it grounds the discussion) to only consider quantities which can be so interpreted. See also this:
The “A=a” stands for the event that the random variable A takes on the value a. It’s another notation for the set {ω ∈ Ω | A(ω) = a}, where Ω is your probability space and A is a random variable (a mapping from Ω to something else, often R^n).
Okay, maybe you know that, but I just want to point out that there is nothing vague about the “A=a” notation. It’s entirely rigorous.
I think the grandparent refers to the fact that in the context of causality (not ordinary probability theory) there is a distinction between ordinary mathematical equality and imperative assignment. That is, when I write a structural equation model:
Y = f(A, M, epsilon(y))
M = g(A, epsilon(m))
A = h(epsilon(a))
and then I use p(A = a) or p(Y = y | do(A = a)) to talk about this model, one could imagine getting confused because the symbol “=” is used in two different ways. Especially for p(Y = y | do(A = a)). This is read as: “the probability of Y being equal to y given that I performed an imperative assignment on the variable A in the above three line program, and set it to value a.” Both senses of “=” are used in the same expression—it is quite confusing!
1) This is fantastic- I keep meaning to read more on how to actually apply Highly Advanced Epistemology to real data, and now I’m learning about it. Thanks!
2) This should be on Main.
3) Does there exist an alternative in the literature to the notation of Pr(A = a)? I hadn’t realized until now how much the use of the equal sign there makes no sense. In standard usage, the equal sign either refers to literal equivalence (or isomorphism) as in functional programming, or variable assignment, as in imperative programming. This operation is obviously not literal equivalence (the set A is not equal to the element a), and it’s only sort of like variable assignment. We do not erase our previous data of the set A: we want it to be around when we talk about observing other events from the set A.
In analogy with Pearl’s “do” notation, I propose that we have an “observe notation”, where Pr(A = a) would be written as Pr(obs_A (a)), and read as “probability that event a is observed from set A,” and not overload our precious equal sign. (The overloading with equivalence vs. variable assignment is already stressful enough for the poor piece of notation.)
I’m not proposing that you change your notation for this sequence, but I feel like this notation might serve for clearer pedagogy in general.
I agree p(A = a) is imprecise.
Good notation for interventions has to permit easy nesting and conflicts for [ good reasons I don’t want to get into right now ]. do(.) actually isn’t very good for this reason (and I have deprecated it in my own work). I like various flavors of the potential outcome notation, e.g. Y(a) to mean “response Y under intervention do(a)”. Ander uses Y^a (with a superscript) for the same thing.
With potential outcomes we can easily express things like “what would happen to Y if A were forced to a, and M were forced to whatever value M attained had A been instead forced to a’ “: Y(a,M(a’)). You can’t even write this down with do(.).
The A=a notation always bugged me too. I like the above notation because it betrays morphism composition.
If we consider random variables as measure(able) spaces and conditional probabilities P(B | A) as stochastic maps B → P(A), then every element ‘a’ of (a countably generated) A induces a point measure → A giving probability 1 to that event. This is the map named by do(a). But since we’re composing maps, not elements, we can use an element a unambiguously to mean its point measure. Then a series of measures separated by ‘,’ give the product measure. In the above example, let a : A (implicitly, → A), a’ : B (implicitly, → B), M : B ~> C, Y : (A,C) ~> D, then Y(a,M(a’)) is a stochastic map ~> D given by composition
EDIT: How do I ascii art?
All of this is a fancy way of saying that “potential outcome” notation conveys exactly the right information to make probabilities behave nicely.
Yes, one of the reasons I am not very fond of subscript or superscript notation (that to be fair is very commonly used) is because it quickly becomes awkward to nest things, and I personally often end up nesting things many level deep. Parentheses is the only thing I found that works acceptably well.
If you think of interventions as a morphism, then it is indeed very natural to think in terms of arbitrary function composition, which leads one to the usual functional notation. The reason people in the causal inference community perhaps do not find this as natural as a mathematician would is because it is difficult to interpret things like Y(a,M(a’)) as idealized experiments we could actually perform. There is a strong custom in the community (a healthy one in my opinion, because it grounds the discussion) to only consider quantities which can be so interpreted. See also this:
http://imai.princeton.edu/research/Design.html
Hunh, I never noticed that. TIL.
The “A=a” stands for the event that the random variable A takes on the value a. It’s another notation for the set {ω ∈ Ω | A(ω) = a}, where Ω is your probability space and A is a random variable (a mapping from Ω to something else, often R^n).
Okay, maybe you know that, but I just want to point out that there is nothing vague about the “A=a” notation. It’s entirely rigorous.
I think the grandparent refers to the fact that in the context of causality (not ordinary probability theory) there is a distinction between ordinary mathematical equality and imperative assignment. That is, when I write a structural equation model:
Y = f(A, M, epsilon(y))
M = g(A, epsilon(m))
A = h(epsilon(a))
and then I use p(A = a) or p(Y = y | do(A = a)) to talk about this model, one could imagine getting confused because the symbol “=” is used in two different ways. Especially for p(Y = y | do(A = a)). This is read as: “the probability of Y being equal to y given that I performed an imperative assignment on the variable A in the above three line program, and set it to value a.” Both senses of “=” are used in the same expression—it is quite confusing!