if there’s no corelation, there almost certainly isn’t causation.
This is completely wrong, though not many people seem to understand that yet.
For example, the voltage across a capacitor is uncorrelated with the current through it; and another poster has pointed out the example of the thermostat, a topic I’ve also written about on occasion.
It’s a fundamental principle of causal inference that you cannot get causal conclusions from wholly acausal premises and data. (See Judea Pearl, passim.) This applies just as much to negative conclusions as positive. Absence of correlation cannot on its own be taken as evidence of absence of causation.
the voltage across a capacitor is uncorrelated with the current through it
It depends. While true when the signal is periodic, it is not so in general. A spike of current through the capacitor results in a voltage change. Trivially, if voltage is an exponent (V=V0exp(-at), then so is current (I=C dV/dt=-aCV0 exp(-at)), with 100% correlation between the two on a given interval.
As for the Milton’s thermostat, only the perfect one is uncorrelated (the better the control system, the less the correlation), and no control system without complete future knowledge of inputs is perfect. Of course, if the control system is good enough, in practice the correlation will drown in the noise. That’s why there is so little good evidence that fiscal (or monetary) policy works.
It depends. While true when the signal is periodic, it is not so in general.
I skipped some details. A crucial condition is that the voltage be bounded in the long term, which excludes the exponential example. Or for finite intervals, if the voltage is the same at the beginning and the end, then over that interval there will be zero correlation with its first derivative. This is true regardless of periodicity. It can be completely random (but differentiable, and well-behaved enough for the correlation coefficient to exist), and the zero correlation will still hold.
Of course, if the control system is good enough, in practice the correlation will drown in the noise.
For every control system that works well enough to be considered a control system at all, the correlation will totally drown in the noise. It will be unmeasurably small, and no investigation of the system using statistical techniques can succeed if it is based on the assumption that causation must produce correlation.
For example, take the simple domestic room thermostat, which turns the heating full on when the temperature is some small delta below the set point, and off when it reaches delta above. To a first approximation, when on, the temperature ramps up linearly, and when off it ramps down linearly. A graph of power output against room temperature will consist of two parallel lines, each traversed at constant velocity. As the ambient temperature outside the room varies, the proportion of time spent in the on state will correspondingly vary. This is the only substantial correlation present in the system, and it is between two variables with no direct causal connection. Neither variable will correlate with the temperature inside. The temperature inside, averaged over many cycles, will be exactly at the set point.
It’s only when this control stystem is close to the limits of its operation—too high or too low an ambient outside temperature—does any measurable correlation develop (due to that approximation of the temperature ramp as linear breaking down). The correlation is a symptom of its incipient lack of control.
no control system without complete future knowledge of inputs is perfect.
Knowledge of future inputs does not necessarily allow improved control. The room thermostat (assuming the sensing element and the heat sources have been sensibly located) keeps the temperature within delta of the set point, and could not do any better given any information beyond what it has, i.e. the actual temperature in the room. It is quite non-trivial to improve on a well-designed controller that senses nothing but the variable it controls.
The capacitor is just a didactic example. Connect it across a laboratory power supply and twiddle the voltage up and down, and you get uncorrelated voltage and current signals.
Somewhere at home I have a gadget for using a computer as a signal generator and oscilloscope. I must try this.
On the other hand, I’d guess that 99% of actual capacitors are the gates of digital FETs (simply due to the mindbogglingly large number of FETs). Given just a moment’s glimpse of the current through such a capacitor, you can deduce quite a bit about its voltage.
For every control system that works well enough to be considered a control system at all, the correlation will totally drown in the noise.
False. Here (second graph) is an example of a real-life thermostat. The correlation between inside and outside temperatures is evident when the outside temperature varies.
The thermostat isn’t actually doing anything in those graphs from about 7am to 4pm. There’s just a brief burst of heat to pump the temperature up in the early morning and a brief burst of cooling in the late afternoon. Of course the indoor temperature will be heavily influenced by the outdoor temperature. It’s being allowed to vary by more than 4 degrees C.
Dynamical dependencies—one variable depending on the derivative or integral of another. (Dealing with these by discretising time and replacing every variable X by an infinite series X0,X1,X2… does not, I believe, yield any useful analysis.) The result is that correlations associated with direct causal links can be exactly zero, yet not in a way that can be described as cancellation of multiple dependencies. The problem is exacerbated when there are also cyclic dependencies.
There has been some work on causal analysis of dynamical systems with feedback, but there are serious obstacles to existing approaches, which I discuss in a paper I’m currently trying to get published.
Sorry, confused. A function is not always uncorrelated with its derivative. Correlation is a measure of co-linearity, not co-dependence. Do you have any examples where statistical dependence does not imply causality without a faithfulness violation? Would you mind maybe sending me a preprint?
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
A function is not always uncorrelated with its derivative.
I omitted some details, crucially that the function be bounded. If it is, then the long-term correlation with its derivative tends to zero, providing only that it’s well-behaved enough for the correlation to be defined. Alternatively, for a finite interval, the correlation is zero if it has the same value at the beginning and the end. This is pretty much immediate from the fact that the integral of x(dx/dt) is (x^2)/2. A similar result holds for time series, the proof proceeding from the discrete analogue of that formula, (x+y)(x-y) = x^2-y^2.
To put that more concretely, if in the long term you’re getting neither richer nor poorer, then there will be no correlation between monthly average bank balance and net monthly income.
Do you have any examples where statistical dependence does not imply causality without a faithfulness violation?
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing? That pretty much is the faithfulness assumption, so of course faithfulness is violated by the systems I’ve mentioned, where causal links are associated with zero correlation. In some cases, if the system is sampled on a timescale longer than its settling time, causal links are associated not only with zero product-moment correlation, but zero mutual information of any sort.
Statistical dependence does imply that somewhere there is causality (considering identity a degenerate case of causality—when X, Y, and Z are independent, X+Y correlates with X+Z). The causality, however, need not be in the same place as the dependence.
Would you mind maybe sending me a preprint?
Certainly. Is this web page current for your email address?
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing?
That’s right, sorry.
I had gotten the impression that you thought causal systems where things are related to derivatives/integrals introduce a case where this happens and it’s not due to “cancellations” but something else. From my point of view, correlation is not a very interesting measure—it’s a holdover from simple parametric statistical models that gets applied far beyond its actual capability.
People misuse simple regression models in the same way. For example, if you use linear causal regressions, direct effects are just regression coefficients. But as soon as you start using interaction terms, this stops being true (but people still try to use coefficients in these cases...)
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.
I think that also works with acyclic graphs: suppose you have an arrow from “eXercising” to “Eating a lot”, one from “Eating a lot” to “gaining Weight”, and one from “eXercising” to “gaining Weight”, and P(X) = 0.5, P(E|X) = 0.99, P(E|~X) = 0.01, P(W|X E) = 0.5, P(W|X ~E) = 0.01, P(W|~X E) = 0.99, P(W|~X ~E) = 0.5. Then W would be nearly uncorrelated with X (P(W|X) = 0.4996, P(W|~X) = 0.5004) and nearly uncorrelated with E (P(W|E) = 0.5004, P(W|~E) = 0.4996, if I did the maths right), but it doesn’t mean it isn’t caused by either.
Yes, this is the mechanism of cancellation of multiple causal paths. In theory one can prove, with assumptions akin to the ideal point masses and inextensible strings of physics exercises, that the probability of exact cancellation is zero; in practice, finite sample sizes mean that cancellation cannot necessarily be excluded.
And then to complicate that example, consider a professional boxer who is trying to maintain his weight just below the top of a given competition band. You then have additional causal arrows back from Weight to both eXercise and Eating. As long as he succeeds in controlling his weight, it won’t correlate with exercise or eating.
This is completely wrong, though not many people seem to understand that yet.
For example, the voltage across a capacitor is uncorrelated with the current through it; and another poster has pointed out the example of the thermostat, a topic I’ve also written about on occasion.
It’s a fundamental principle of causal inference that you cannot get causal conclusions from wholly acausal premises and data. (See Judea Pearl, passim.) This applies just as much to negative conclusions as positive. Absence of correlation cannot on its own be taken as evidence of absence of causation.
It depends. While true when the signal is periodic, it is not so in general. A spike of current through the capacitor results in a voltage change. Trivially, if voltage is an exponent (V=V0exp(-at), then so is current (I=C dV/dt=-aCV0 exp(-at)), with 100% correlation between the two on a given interval.
As for the Milton’s thermostat, only the perfect one is uncorrelated (the better the control system, the less the correlation), and no control system without complete future knowledge of inputs is perfect. Of course, if the control system is good enough, in practice the correlation will drown in the noise. That’s why there is so little good evidence that fiscal (or monetary) policy works.
I skipped some details. A crucial condition is that the voltage be bounded in the long term, which excludes the exponential example. Or for finite intervals, if the voltage is the same at the beginning and the end, then over that interval there will be zero correlation with its first derivative. This is true regardless of periodicity. It can be completely random (but differentiable, and well-behaved enough for the correlation coefficient to exist), and the zero correlation will still hold.
For every control system that works well enough to be considered a control system at all, the correlation will totally drown in the noise. It will be unmeasurably small, and no investigation of the system using statistical techniques can succeed if it is based on the assumption that causation must produce correlation.
For example, take the simple domestic room thermostat, which turns the heating full on when the temperature is some small delta below the set point, and off when it reaches delta above. To a first approximation, when on, the temperature ramps up linearly, and when off it ramps down linearly. A graph of power output against room temperature will consist of two parallel lines, each traversed at constant velocity. As the ambient temperature outside the room varies, the proportion of time spent in the on state will correspondingly vary. This is the only substantial correlation present in the system, and it is between two variables with no direct causal connection. Neither variable will correlate with the temperature inside. The temperature inside, averaged over many cycles, will be exactly at the set point.
It’s only when this control stystem is close to the limits of its operation—too high or too low an ambient outside temperature—does any measurable correlation develop (due to that approximation of the temperature ramp as linear breaking down). The correlation is a symptom of its incipient lack of control.
Knowledge of future inputs does not necessarily allow improved control. The room thermostat (assuming the sensing element and the heat sources have been sensibly located) keeps the temperature within delta of the set point, and could not do any better given any information beyond what it has, i.e. the actual temperature in the room. It is quite non-trivial to improve on a well-designed controller that senses nothing but the variable it controls.
Exponential decay is a very very ordinary process to find a capacitor in. Most capacitors are not in feedback control systems.
The capacitor is just a didactic example. Connect it across a laboratory power supply and twiddle the voltage up and down, and you get uncorrelated voltage and current signals.
Somewhere at home I have a gadget for using a computer as a signal generator and oscilloscope. I must try this.
On the other hand, I’d guess that 99% of actual capacitors are the gates of digital FETs (simply due to the mindbogglingly large number of FETs). Given just a moment’s glimpse of the current through such a capacitor, you can deduce quite a bit about its voltage.
False. Here (second graph) is an example of a real-life thermostat. The correlation between inside and outside temperatures is evident when the outside temperature varies.
The thermostat isn’t actually doing anything in those graphs from about 7am to 4pm. There’s just a brief burst of heat to pump the temperature up in the early morning and a brief burst of cooling in the late afternoon. Of course the indoor temperature will be heavily influenced by the outdoor temperature. It’s being allowed to vary by more than 4 degrees C.
OK, maybe I misunderstood your original point.
I wonder why EY didn’t make an example of that in Stuff That Makes Stuff Happen.
Examples like the ones I gave are not to be found in Pearl, and hardly at all in the causal analysis literature.
Sorry, can you clarify what you mean by “like the ones”. What is the distinguishing feature?
Dynamical dependencies—one variable depending on the derivative or integral of another. (Dealing with these by discretising time and replacing every variable X by an infinite series X0,X1,X2… does not, I believe, yield any useful analysis.) The result is that correlations associated with direct causal links can be exactly zero, yet not in a way that can be described as cancellation of multiple dependencies. The problem is exacerbated when there are also cyclic dependencies.
There has been some work on causal analysis of dynamical systems with feedback, but there are serious obstacles to existing approaches, which I discuss in a paper I’m currently trying to get published.
Sorry, confused. A function is not always uncorrelated with its derivative. Correlation is a measure of co-linearity, not co-dependence. Do you have any examples where statistical dependence does not imply causality without a faithfulness violation? Would you mind maybe sending me a preprint?
edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
I omitted some details, crucially that the function be bounded. If it is, then the long-term correlation with its derivative tends to zero, providing only that it’s well-behaved enough for the correlation to be defined. Alternatively, for a finite interval, the correlation is zero if it has the same value at the beginning and the end. This is pretty much immediate from the fact that the integral of x(dx/dt) is (x^2)/2. A similar result holds for time series, the proof proceeding from the discrete analogue of that formula, (x+y)(x-y) = x^2-y^2.
To put that more concretely, if in the long term you’re getting neither richer nor poorer, then there will be no correlation between monthly average bank balance and net monthly income.
Don’t you mean causality not implying statistical dependence, which is what these examples have been showing? That pretty much is the faithfulness assumption, so of course faithfulness is violated by the systems I’ve mentioned, where causal links are associated with zero correlation. In some cases, if the system is sampled on a timescale longer than its settling time, causal links are associated not only with zero product-moment correlation, but zero mutual information of any sort.
Statistical dependence does imply that somewhere there is causality (considering identity a degenerate case of causality—when X, Y, and Z are independent, X+Y correlates with X+Z). The causality, however, need not be in the same place as the dependence.
Certainly. Is this web page current for your email address?
That’s right, sorry.
I had gotten the impression that you thought causal systems where things are related to derivatives/integrals introduce a case where this happens and it’s not due to “cancellations” but something else. From my point of view, correlation is not a very interesting measure—it’s a holdover from simple parametric statistical models that gets applied far beyond its actual capability.
People misuse simple regression models in the same way. For example, if you use linear causal regressions, direct effects are just regression coefficients. But as soon as you start using interaction terms, this stops being true (but people still try to use coefficients in these cases...)
Yes, the Harvard address still works.
I just noticed your edit:
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
Dear Richard,
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.
I think that also works with acyclic graphs: suppose you have an arrow from “eXercising” to “Eating a lot”, one from “Eating a lot” to “gaining Weight”, and one from “eXercising” to “gaining Weight”, and P(X) = 0.5, P(E|X) = 0.99, P(E|~X) = 0.01, P(W|X E) = 0.5, P(W|X ~E) = 0.01, P(W|~X E) = 0.99, P(W|~X ~E) = 0.5. Then W would be nearly uncorrelated with X (P(W|X) = 0.4996, P(W|~X) = 0.5004) and nearly uncorrelated with E (P(W|E) = 0.5004, P(W|~E) = 0.4996, if I did the maths right), but it doesn’t mean it isn’t caused by either.
Yes, this is the mechanism of cancellation of multiple causal paths. In theory one can prove, with assumptions akin to the ideal point masses and inextensible strings of physics exercises, that the probability of exact cancellation is zero; in practice, finite sample sizes mean that cancellation cannot necessarily be excluded.
And then to complicate that example, consider a professional boxer who is trying to maintain his weight just below the top of a given competition band. You then have additional causal arrows back from Weight to both eXercise and Eating. As long as he succeeds in controlling his weight, it won’t correlate with exercise or eating.