Causality is useful mainly insofar as different instances can be compactly described as different simple interventions on the same Bayes net.
Thinking about this algorithmically: In e.g. factor analysis, after performing PCA to reduce a high-dimensional dataset to a low-dimensional one, it’s common to use varimax to “rotate” the principal components so that each resulting axis has a sparse relationship with the original indicator variables (each “principal” component correlating only with one indicator). However, this instead seems to suggest that one should rotate them so that the resulting axes have a sparse relationship with the original cases (each data point deviating from the mean on as few “principal” components as possible).
I believe that this sort of rotation (without the PCA) has actually been used in certain causal inference algorithms, but as far as I can tell it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis, which admittedly seems plausible for a lot of cases, but also seems like it consistently gives the wrong results if you’ve got certain nonlinear/thresholding effects (which seem plausible in some of the areas I’ve been looking to apply it).
Not sure whether you’d say I’m thinking about this right?
For instance, in the sprinkler system, some days the sprinkler is in fact turned off, or there’s a tarp up, or what have you, and then the system is well-modeled by a simple intervention.
I’m trying to think of why modelling this using a simple intervention is superior to modelling it as e.g. a conditional. One answer I could come up with is if there’s some correlations across the different instances of the system, e.g. seasonable variation in rain or similar, or turning the sprinkler on partway through a day. Though these sorts of correlations are probably best modelled by expanding the Bayesian network to include time or similar.
I believe that this sort of rotation (without the PCA) has actually been used in certain causal inference algorithms, but as far as I can tell it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis, which admittedly seems plausible for a lot of cases, but also seems like it consistently gives the wrong results if you’ve got certain nonlinear/thresholding effects (which seem plausible in some of the areas I’ve been looking to apply it).
Where did you get this notion about kurtosis? Factor analysis or PCA only take in a correlation matrix as input, and so only model the second order moments of the joint distribution (i.e. correlations/variances/covariances, but not kurtosis). In fact, it is sometimes assumed in factor analysis that all variables and latent factors are jointly multivariate normal (and so all random variables have excess kurtosis 0).
Bayes net is not the same thing as PCA/factor analysis in part because it is trying to factor the entire joint distribution rather than just the correlation matrix.
This part of the comment wasn’t about PCA/FA, hence “without the PCA”. The formal name for what I had in mind is ICA, which often works by maximizing kurtosis.
What you seemed to be saying is that a certain rotation (“one should rotate them so that the resulting axes have a sparse relationship with the original cases”) has “actually been used” and “it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis”.
I don’t see what the kurtosis-maximizing algorithm has to do with the choice of rotation used in factor analysis or PCA.
Thinking about this algorithmically: In e.g. factor analysis, after performing PCA to reduce a high-dimensional dataset to a low-dimensional one, it’s common to use varimax to “rotate” the principal components so that each resulting axis has a sparse relationship with the original indicator variables (each “principal” component correlating only with one indicator). However, this instead seems to suggest that one should rotate them so that the resulting axes have a sparse relationship with the original cases (each data point deviating from the mean on as few “principal” components as possible).
I believe that this sort of rotation (without the PCA) has actually been used in certain causal inference algorithms, but as far as I can tell it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis, which admittedly seems plausible for a lot of cases, but also seems like it consistently gives the wrong results if you’ve got certain nonlinear/thresholding effects (which seem plausible in some of the areas I’ve been looking to apply it).
Not sure whether you’d say I’m thinking about this right?
I’m trying to think of why modelling this using a simple intervention is superior to modelling it as e.g. a conditional. One answer I could come up with is if there’s some correlations across the different instances of the system, e.g. seasonable variation in rain or similar, or turning the sprinkler on partway through a day. Though these sorts of correlations are probably best modelled by expanding the Bayesian network to include time or similar.
Where did you get this notion about kurtosis? Factor analysis or PCA only take in a correlation matrix as input, and so only model the second order moments of the joint distribution (i.e. correlations/variances/covariances, but not kurtosis). In fact, it is sometimes assumed in factor analysis that all variables and latent factors are jointly multivariate normal (and so all random variables have excess kurtosis 0).
Bayes net is not the same thing as PCA/factor analysis in part because it is trying to factor the entire joint distribution rather than just the correlation matrix.
This part of the comment wasn’t about PCA/FA, hence “without the PCA”. The formal name for what I had in mind is ICA, which often works by maximizing kurtosis.
What you seemed to be saying is that a certain rotation (“one should rotate them so that the resulting axes have a sparse relationship with the original cases”) has “actually been used” and “it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis”.
I don’t see what the kurtosis-maximizing algorithm has to do with the choice of rotation used in factor analysis or PCA.