I don’t understand the motivation behind the fundamental theorem, and I’m wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that “represent” observations somehow (“represents” maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I’m not sure why I would want to propose an underlying family and a function at all, and even if I do I’m not sure why I want to suppose it is mutually independent.
One thought I had is that maybe this underlying family of distributions on S is supposed to represent “interventions”. The reasoning would be something like: there is some set of things that fully control my observations that I can control independently and which also vary independently on their own. I don’t find this convincing, though—I don’t see why independent controllability should imply independent variation.
Another thought I had was that maybe it arises from some kind of maximum entropy argument, but it’s not clear why I would want to maximise the entropy of a distribution on some S for every possible configuration of marginals.
Also, do you know how your model relates to structural equation models with hidden variables? Your factored set S looks a lot like a set of independent “noises”, and the function f:S->Y looks a lot like a set of structural equations and I think it’s straightforward to introduce hidden variables as needed to account for any lossiness. In particular, given a model and a compatible orthogonality database, I can construct an SEM by taking all the variables that appear in the database and defining the structural equation for X to be X:=X∘f. However, the set of all SEMs that are compatible with a given orthogonality database I think is larger than the set of all FFS models that are similarly compatible. This is because SEMs (in the Pearlean sense) can be distinct even if they have “apparently identical” structural equations. For example, X:=1,Y:=Xand X:=1,Y:=1 are distinct because interventions on X will have different results, while my understanding is that they will correspond to the same FFS model.
Your result 2e looks interestingly similar to the DAG result that says X⊥Z and X⊥/Z|Y implies something like X→Y←Z (where ⊥ is d-separation). In fact, I think it extends existing graph learning algorithms: in addition to checking independences among the variables as given, you can look for independences between any functions of the given variables. This seems like it would give you many more arrows than usual, though I imagine it would also increase your risk of spurious indepdendences. In fact, I think this connects to approaches to causal identification like regression with subsequent independence test: if X is independent of Y−E[Y|X], we prefer X→Y, and if Y is independent of X−E[X|Y], prefer Y→X.
Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?
Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.
I am not thinking of it as related to a maximum entropy argument.
The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the “apparently identical.” Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.
I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don’t consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.
I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice.
On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I’d rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type—I was just wondering if there were reasons for doing that are apparent before you derive any useful results.
Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by “naturalness”, there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works!
However, I’m pretty sure you’ll need something like a naturalness ordering in order to separate “true orthogonality” from “merely apparent orthogonality”, which is why I think it’s fair to posit it as an element of “whatever else you need anyway”. Maybe not.
I don’t understand the motivation behind the fundamental theorem, and I’m wondering if you could say a bit more about it. In particular, it suggests that if I want to propose a family of probability distributions that “represent” observations somehow (“represents” maybe means in the sense of Bayesian predictions or in the sense of frequentist limits), I want to also consider this family to arise from some underlying mutually independent family along with some function. I’m not sure why I would want to propose an underlying family and a function at all, and even if I do I’m not sure why I want to suppose it is mutually independent.
One thought I had is that maybe this underlying family of distributions on S is supposed to represent “interventions”. The reasoning would be something like: there is some set of things that fully control my observations that I can control independently and which also vary independently on their own. I don’t find this convincing, though—I don’t see why independent controllability should imply independent variation.
Another thought I had was that maybe it arises from some kind of maximum entropy argument, but it’s not clear why I would want to maximise the entropy of a distribution on some S for every possible configuration of marginals.
Also, do you know how your model relates to structural equation models with hidden variables? Your factored set S looks a lot like a set of independent “noises”, and the function f:S->Y looks a lot like a set of structural equations and I think it’s straightforward to introduce hidden variables as needed to account for any lossiness. In particular, given a model and a compatible orthogonality database, I can construct an SEM by taking all the variables that appear in the database and defining the structural equation for X to be X:=X∘f. However, the set of all SEMs that are compatible with a given orthogonality database I think is larger than the set of all FFS models that are similarly compatible. This is because SEMs (in the Pearlean sense) can be distinct even if they have “apparently identical” structural equations. For example, X:=1,Y:=Xand X:=1,Y:=1 are distinct because interventions on X will have different results, while my understanding is that they will correspond to the same FFS model.
Your result 2e looks interestingly similar to the DAG result that says X⊥Z and X⊥/Z|Y implies something like X→Y←Z (where ⊥ is d-separation). In fact, I think it extends existing graph learning algorithms: in addition to checking independences among the variables as given, you can look for independences between any functions of the given variables. This seems like it would give you many more arrows than usual, though I imagine it would also increase your risk of spurious indepdendences. In fact, I think this connects to approaches to causal identification like regression with subsequent independence test: if X is independent of Y−E[Y|X], we prefer X→Y, and if Y is independent of X−E[X|Y], prefer Y→X.
Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?
Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.
I am not thinking of it as related to a maximum entropy argument.
The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the “apparently identical.” Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.
I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don’t consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.
I think the motivation for the representability of some sets of conditional independences with a DAG is pretty clear, because people already use probability distributions all the time, they sometimes have conditional independences and visuals are nice.
On the other hand the fundamental theorem relates orthogonality to independences in a family of distributions generated in a particular way. Neither of these things are natural properties of probability distributions in the way that conditional independence is. If I am using probability distributions, it seems to me I’d rather avoid introducing them if I can. Even if the reasons are mysterious, it might be useful to work with models of this type—I was just wondering if there were reasons for doing that are apparent before you derive any useful results.
Alternatively, is it plausible that you could derive the same results just using probability + whatever else you need anyway? For example, you could perhaps define X to be prior to Y if, relative to some ordering of functions by “naturalness”, there is a more natural f(X,Y) such that X⊥f(X,Y) and X⊥/f(X,Y)|Y than any g(X,Y) such that Y⊥g(X,Y) etc. I have no idea if that actually works!
However, I’m pretty sure you’ll need something like a naturalness ordering in order to separate “true orthogonality” from “merely apparent orthogonality”, which is why I think it’s fair to posit it as an element of “whatever else you need anyway”.Maybe not.