Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?
Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.
I am not thinking of it as related to a maximum entropy argument.
The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the “apparently identical.” Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.
I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don’t consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.
I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice.
On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I’d rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type—I was just wondering if there were reasons for doing that are apparent before you derive any useful results.
Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by “naturalness”, there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works!
However, I’m pretty sure you’ll need something like a naturalness ordering in order to separate “true orthogonality” from “merely apparent orthogonality”, which is why I think it’s fair to posit it as an element of “whatever else you need anyway”. Maybe not.
Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?
Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.
I am not thinking of it as related to a maximum entropy argument.
The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the “apparently identical.” Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.
I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don’t consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.
I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice.
On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I’d rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type—I was just wondering if there were reasons for doing that are apparent before you derive any useful results.
Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by “naturalness”, there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works!
However, I’m pretty sure you’ll need something like a naturalness ordering in order to separate “true orthogonality” from “merely apparent orthogonality”, which is why I think it’s fair to posit it as an element of “whatever else you need anyway”.Maybe not.