Dynamical dependencies—one variable depending on the derivative or integral of another. (Dealing with these by discretising time and replacing every variable X by an infinite series X0,X1,X2… does not, I believe, yield any useful analysis.) The result is that correlations associated with direct causal links can be exactly zero, yet not in a way that can be described as cancellation of multiple dependencies. The problem is exacerbated when there are also cyclic dependencies.
There has been some work on causal analysis of dynamical systems with feedback, but there are serious obstacles to existing approaches, which I discuss in a paper I’m currently trying to get published.
Sorry, confused. A function is not always uncorrelated with its derivative. Correlation is a measure of co-linearity, not co-dependence. Do you have any examples where statistical dependence does not imply causality without a faithfulness violation? Would you mind maybe sending me a preprint?
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
A function is not always uncorrelated with its derivative.
I omitted some details, crucially that the function be bounded. If it is, then the long-term correlation with its derivative tends to zero, providing only that it’s well-behaved enough for the correlation to be defined. Alternatively, for a finite interval, the correlation is zero if it has the same value at the beginning and the end. This is pretty much immediate from the fact that the integral of x(dx/dt) is (x^2)/2. A similar result holds for time series, the proof proceeding from the discrete analogue of that formula, (x+y)(x-y) = x^2-y^2.
To put that more concretely, if in the long term you’re getting neither richer nor poorer, then there will be no correlation between monthly average bank balance and net monthly income.
Do you have any examples where statistical dependence does not imply causality without a faithfulness violation?
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing? That pretty much is the faithfulness assumption, so of course faithfulness is violated by the systems I’ve mentioned, where causal links are associated with zero correlation. In some cases, if the system is sampled on a timescale longer than its settling time, causal links are associated not only with zero product-moment correlation, but zero mutual information of any sort.
Statistical dependence does imply that somewhere there is causality (considering identity a degenerate case of causality—when X, Y, and Z are independent, X+Y correlates with X+Z). The causality, however, need not be in the same place as the dependence.
Would you mind maybe sending me a preprint?
Certainly. Is this web page current for your email address?
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing?
That’s right, sorry.
I had gotten the impression that you thought causal systems where things are related to derivatives/integrals introduce a case where this happens and it’s not due to “cancellations” but something else. From my point of view, correlation is not a very interesting measure—it’s a holdover from simple parametric statistical models that gets applied far beyond its actual capability.
People misuse simple regression models in the same way. For example, if you use linear causal regressions, direct effects are just regression coefficients. But as soon as you start using interaction terms, this stops being true (but people still try to use coefficients in these cases...)
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.
I think that also works with acyclic graphs: suppose you have an arrow from “eXercising” to “Eating a lot”, one from “Eating a lot” to “gaining Weight”, and one from “eXercising” to “gaining Weight”, and P(X) = 0.5, P(E|X) = 0.99, P(E|~X) = 0.01, P(W|X E) = 0.5, P(W|X ~E) = 0.01, P(W|~X E) = 0.99, P(W|~X ~E) = 0.5. Then W would be nearly uncorrelated with X (P(W|X) = 0.4996, P(W|~X) = 0.5004) and nearly uncorrelated with E (P(W|E) = 0.5004, P(W|~E) = 0.4996, if I did the maths right), but it doesn’t mean it isn’t caused by either.
Yes, this is the mechanism of cancellation of multiple causal paths. In theory one can prove, with assumptions akin to the ideal point masses and inextensible strings of physics exercises, that the probability of exact cancellation is zero; in practice, finite sample sizes mean that cancellation cannot necessarily be excluded.
And then to complicate that example, consider a professional boxer who is trying to maintain his weight just below the top of a given competition band. You then have additional causal arrows back from Weight to both eXercise and Eating. As long as he succeeds in controlling his weight, it won’t correlate with exercise or eating.
Dynamical dependencies—one variable depending on the derivative or integral of another. (Dealing with these by discretising time and replacing every variable X by an infinite series X0,X1,X2… does not, I believe, yield any useful analysis.) The result is that correlations associated with direct causal links can be exactly zero, yet not in a way that can be described as cancellation of multiple dependencies. The problem is exacerbated when there are also cyclic dependencies.
There has been some work on causal analysis of dynamical systems with feedback, but there are serious obstacles to existing approaches, which I discuss in a paper I’m currently trying to get published.
Sorry, confused. A function is not always uncorrelated with its derivative. Correlation is a measure of co-linearity, not co-dependence. Do you have any examples where statistical dependence does not imply causality without a faithfulness violation? Would you mind maybe sending me a preprint?
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
I omitted some details, crucially that the function be bounded. If it is, then the long-term correlation with its derivative tends to zero, providing only that it’s well-behaved enough for the correlation to be defined. Alternatively, for a finite interval, the correlation is zero if it has the same value at the beginning and the end. This is pretty much immediate from the fact that the integral of x(dx/dt) is (x^2)/2. A similar result holds for time series, the proof proceeding from the discrete analogue of that formula, (x+y)(x-y) = x^2-y^2.
To put that more concretely, if in the long term you’re getting neither richer nor poorer, then there will be no correlation between monthly average bank balance and net monthly income.
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing? That pretty much is the faithfulness assumption, so of course faithfulness is violated by the systems I’ve mentioned, where causal links are associated with zero correlation. In some cases, if the system is sampled on a timescale longer than its settling time, causal links are associated not only with zero product-moment correlation, but zero mutual information of any sort.
Statistical dependence does imply that somewhere there is causality (considering identity a degenerate case of causality—when X, Y, and Z are independent, X+Y correlates with X+Z). The causality, however, need not be in the same place as the dependence.
Certainly. Is this web page current for your email address?
That’s right, sorry.
I had gotten the impression that you thought causal systems where things are related to derivatives/integrals introduce a case where this happens and it’s not due to “cancellations” but something else. From my point of view, correlation is not a very interesting measure—it’s a holdover from simple parametric statistical models that gets applied far beyond its actual capability.
People misuse simple regression models in the same way. For example, if you use linear causal regressions, direct effects are just regression coefficients. But as soon as you start using interaction terms, this stops being true (but people still try to use coefficients in these cases...)
Yes, the Harvard address still works.
I just noticed your edit:
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
Dear Richard,
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.
I think that also works with acyclic graphs: suppose you have an arrow from “eXercising” to “Eating a lot”, one from “Eating a lot” to “gaining Weight”, and one from “eXercising” to “gaining Weight”, and P(X) = 0.5, P(E|X) = 0.99, P(E|~X) = 0.01, P(W|X E) = 0.5, P(W|X ~E) = 0.01, P(W|~X E) = 0.99, P(W|~X ~E) = 0.5. Then W would be nearly uncorrelated with X (P(W|X) = 0.4996, P(W|~X) = 0.5004) and nearly uncorrelated with E (P(W|E) = 0.5004, P(W|~E) = 0.4996, if I did the maths right), but it doesn’t mean it isn’t caused by either.
Yes, this is the mechanism of cancellation of multiple causal paths. In theory one can prove, with assumptions akin to the ideal point masses and inextensible strings of physics exercises, that the probability of exact cancellation is zero; in practice, finite sample sizes mean that cancellation cannot necessarily be excluded.
And then to complicate that example, consider a professional boxer who is trying to maintain his weight just below the top of a given competition band. You then have additional causal arrows back from Weight to both eXercise and Eating. As long as he succeeds in controlling his weight, it won’t correlate with exercise or eating.