Causation, Probability and Objectivity
Most people here seem to endorse the following two claims:
1. Probability is “in the mind,” i.e., probability claims are true only in relation to some prior distribution and set of information to be conditionalized on;
2. Causality is to be cashed out in terms of probability distributions á la Judea Pearl or something.
However, these two claims feel in tension to me, since they appear to have the consequence that causality is also “in the mind”—whether something caused something else depends on various probability distributions, which in turn depends on how much we know about the situation. Worse, it has the consequence that ideal Bayesian reasoners can never be wrong about causal relations, since they always have perfect knowledge of their own probabilities.
Since I don’t understand Pearl’s model of causality very well, I may be missing something fundamental, so this is more of a question than an argument.
Pearl takes causality to be primitive, not something to be defined in terms of probabilities. See, for example, “Bayesianism and causality, or, why I am only a half-Bayesian”. A basic principle of his methods is that without causal assumptions, no causal conclusions can be obtained: “one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable in observational studies”. Such assumptions can be domain-specific knowledge, or they can be general assumptions, such as that the true causal relationships form a DAG possessing the Markov and faithfulness properties.
It should also be noted that Pearl is not God, only a Turing Prize winner. :) It appears that there are disputes within the statistical profession over his methods. I’m not informed on this, but see here for a discussion I came across while trying to track down the quote above.
In the referenced paper, Pearl writes:
Really? Has no one made any progress on this? I would think it would be a fairly straightforward application of comparing the entropy of f(y|x) versus f(x|y), and preferring the model with minimal entropy. I’d expect this to work because causal relations will in general be many to one, so that the causal model gives a tight effect, while the anticausal model would have a spread entropy covering the multiple causes for the effect. When a relation is one to one, then either model suffices for accurate predictions, and I don’t need to care.
I’d doubt that a brain, or the mathematics to describe it, would need more than this. We call x a sufficient cause of y if f(y|x) satisfies some condition on it’s entropy.
I agree with Pearl about the wonders of baking in our causal knowledge in terms of our choice of functions in a networked representation, but only see that as injecting our prior knowledge of the entropy of the the conditional distributions above.
I haven’t followed the literature for years. Does anyone know where this issue stands?
“Really? Has no one made any progress on this?”
(Interventionist) causality is not about probability, it is about responses to hypothetical interventions. Probability is just there to model uncertainty, it is not at all needed (in fact Pearl’s first definition of causal models is deterministic).
I think it is also a fair claim that “causality is in the mind,” since there does not seem to be any causality in quantum mechanics.
You can use probabilistic models to predict the result of interventions without ever using the word cause.
A deterministic y=f(x) is mathematically just a limiting case of a conditional f(y|x).
I haven’t kept up with the literature for a while, but my PhD was predominantly about embedding causal forward models in a probabilistic framework, and using the network for inference. I was reading both Jaynes and Pearl at the time. The above is always how I considered the relationship between causal models and probabilistic models, and I didn’t run into situations where such a formulation ran into problems.
Interventions do introduce a new variable into an observational model, the intervening action, so one should not be surprised that the observational model may need adjustment when being conditioned on information that was false (the intervention) during the observational period.
I would be interested to hear about how causality and the arrow of time are dealt with in quantum theory, and whether it requires anything more than probabilistic notation. If, as you say, they don’t require some special notions of causality, I’d take it that Hume wins again.
This or something similar is the starting point for most approaches to causality, but in general there are going to be many factors having a causal relationship with each variable in your model, and so there are plenty of opportunities for the inequality relating f(y|x) and f(x|y) to switch sign. I haven’t done much stuff with causality, though, so take this with a grain of salt. Here is a recent paper in the subject, if you’re interested.
EDIT: I guess what I’m really trying to say is that x may only have a causal influence on y if a bunch of other factors are present, so it can be hard to tell what’s going on just from your graphical model. I’m substantially less confident than 15 seconds ago that this comment makes sense, though.
Which can be represented in a straightforward fashion in Jaynes’s notation.
f(y | x0, x1=C… xN=C2)
If x “is a cause” of y when x1...xN, then this conditional will accurately predict y without ever saying “cause”. The causal talk seems to me superfluous mathematically—it’s just describing limiting cases of conditionals.
If you literally think that conditional probabilities describe causation, then you should water your grass to make it rain (because p(rain | grass-is-wet) is higher than p(rain | grass-is-dry)). Causation is not about prediction.
I’m only starting to get into this stuff, so I don’t have an answer, only some more references.
Here is chapter 11 of Pearl’s Book, consisting of his 2009 responses to and discussions with readers, which begins with a strong defence of the necessity of separating causal and statistical concepts. Here is a later state of the Pearl/Rubin discussion on Gelman’s blog, with links to earlier instalments.
Here’s the short version:
Question: we want to estimate a causal effect of X on Y from observational data, but we have confounding variables we observe. What variables do we adjust for to get an unbiased estimate of causal effect.
Rubin: All of them (we should condition on all available data, so we don’t waste information).
Pearl: those and only those which block back-door paths but not causal paths in the graph.
I think what is going on is there are two separate issues here. Pearl is talking about an identification issue—what functional represents causal effects in an unbiased way. Rubin is talking about an estimation issue—we should use all available information to reduce uncertainty in our estimate. Pearl is talking about bias, Rubin is talking about variance.
In my view, the “right answer” is that if we want the effect of X on Y, we have to both:
(a) Use all available information (the functional for the effect is a function of all variables ancestral of Y not through X).
(b) Use all available information in the “right way” to avoid bias. That is, we don’t just want to condition on a particular ancestor of Y, we may have to do more complex things to avoid bias.
Here’s a paper we wrote that gives an unbiased maximum likelihood estimator for all identifiable causal effects in discrete models with hidden variables: http://arxiv.org/pdf/1202.3763.pdf. Because the estimate is an MLE it uses all information like Rubin wants. Because the estimate is unbiased, Pearl should be happy as well.
By the way, “M-bias” refers to a situation where we observe a variable that correlates with both X and Y but is not an ancestor of Y not through X. Simplest graph: X → Y <-> W <-> X. In this case, the right thing to do is to not condition on W, or indeed use W in any way when estimating p(y | do(x)). The MLE for p(y | do(x)) does not use W, so we don’t lose information by ignoring W. So in this particular case, Pearl is right to worry about bias when conditioning on W, and Rubin is wrong to worry about missing information when not conditioning on W (there is no information to miss).
“All of them” cannot obviously be literally true, because for instance we don’t want to condition on the future of Y even if we observe it (the future is just the noisy sensor version of the present, it carries no extra information, just extra randomness).
From your description, it seems that Rubin wants to predict what happens in the world, and Pearl insists on asking and answering questions about what happens in the world in terms of causal language.
What’s the simplest prediction of what happens in the world that Pearl would claim Rubin cannot accurately make?
If there is no such limitation in Rubin’s approach, we’re arguing convenient notation. My preference lies with the most general notation, with the least amount of special case jargon, so I likely will be on Rubin’s side.
Pearl likes graphs, but graphs are just a mathematical aid. What he and Rubin are talking about is not “about” graphs. You can prove all the theorems without graphs. Both Pearl and Rubin are talking about potential outcomes (interventionist view of causality). Pearl uses a model which makes cross-world independence assumptions (Rubin probably does not, although I have not asked him. Of course Rubin loves “principal stratification” which as far as I understand is wildly untestable, so who really knows what he thinks. A lot of workers in the field do not like cross-world independences because they are not testable).
To the extent that Rubin wants to estimate potential outcome random variables from observational data, he HAS to agree with Pearl on pain of bias (e.g. garbage). In the example I gave, if Rubin insists on conditioning on W, he will get a garbage answer for the potential outcome Y(x). Identification of potential outcomes isn’t the kind of thing where you can have a difference of opinion. It’s like having on opinion on what 2 + 2 is.
From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.
Let Jaynes notation do the work. The base problem seems to be:
You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?
Do these guys have any case where they make different predictions of what will happen in an intervention? Or do they just dance around in their own languages and come up with the same predictions?
“From your description, you say that Rubin insists on conditioning on all available data, so that includes W. But that doesn’t mean he has to get garbage, that just means he needs the right conditional.”
The right expression for p(y | do(x)) in this example should ignore W, that’s all there is to it. It’s not a notational issue.
“You can assign probabilities using observational data to create P(X1...XN | Intervention=No). How do I use that model to assign P(X1...XN | Intervention=Yes)?”
Good question! The answer is to use something called the consistency assumption (I think Pearl might call it “composition” in his book). This states, roughly that Y(X) = Y. (That is, observing Y when there is no intervention is the same as observing Y when X is intervened to attain whatever value it would naturally attain). This assumption is untestable, but to my knowledge every single paper in causal inference makes this assumption in some form. Without something like this assumption there is no link between the data we observe and the data after a hypothetical intervention.
I think the kinds of examples that are drastically biased given Rubin’s “condition on everything” policy are not very common in practical data analysis problems, but it’s certainly easy to construct them. While I have not asked him, I suspect if I were to put a gun to Rubin’s head and gave him the above example, he will admit to not adjusting on W (and then say the situations in the example never happen in practice).
My view: M-bias is a special case of a more general issue where conditioning opens paths (due to how d-separation works in graphs). The way this issue manifests in practice is people assume they observe all confounders, adjust for them, get an estimate, and call it a day. In practice, their assumption is wrong, adjusting for all observable confounders opens a bunch of non-causal paths due to the inevitable presence of hidden variables, and the estimate they get is biased for this reason. There is, however, some evidence that this bias is sometimes not very big (I think Sander Greenland did some work on this)
I expect that your concern doesn’t really have to do with causality and is already there in “probability is in the mind.” But I’m going to argue that causality is in the mind.
Causality is in the model. Let’s say that we work with billiard ball physics of gas. Since the model is reversible, it has no arrow of time (or maybe it has both arrows of time). In fact, I might say that the “real thing” is a timeless view and choosing to slice by time is a modeling choice. If we know that the gas is concentrated in half of the room at time zero (a probability distribution over microstates), we have a forward arrow of causality. But talking about what we know is about the mind. This assumption leads to a backwards arrow during negative time, but that’s OK, because the model is wrong for negative time: we put the gas in half of the room, violating the model.
Thermodynamics is in the mind. The causality of thermodynamics is in the mind. Thermodynamics is a good model because it is about things we can really measure, not microstates that we cannot; it is a good model because it matches our ignorance, which is in our minds.
An ideal reasoner (eg, AIXI) would not use thermodynamics, but work directly with microstates, but this is not practical if the reasoner is in the world. But AIXI does have belief about the underlying model of the universe, which can be wrong.
I don’t understand where the tension is supposed to come in. The idea that causation is in the mind, not in the world is part of the Humean tradition and has been a respected (although minority) position in philosophy for centuries. If anything, it seems to mesh particularly well with empiricist leaning philosophies (especially those with an anti-metaphysical stance).
It just seems really weird to be able to correctly say that A caused B when, in fact, A had nothing to do with B. If that doesn’t seem weird to you, then O.K.
I think that’s unclear; I side with those who think Hume was arguing for causal skepticism rather than some sort of subjectivism.
This point is completely independent of whether causation is “in the mind” or not. Also, correlated things do have something to do with each other (by definition!). What is at issue is whether this something is “out in the world” or “in your head”.
Right, there is probably no consensus on Humean interpretation. In any case, Hume would predict with near certainty that a billiard ball that was struck by a second billiard ball would make a sound and roll away in regular manner, just the same as you would. But since he doesn’t need this “causal necessity” thing “out in the world” somewhere in order to coherently make the same prediction, your web-of-belief real estate seems to have lower rent than Hume’s.
“Causation is in the mind” does not imply “correlation is in the mind,” does it? I mean, assuming a deterministic interpretation of QM, causal determinism is pretty much a correct philosophical position. That means causality, in the Pearl sense, really is only in the mind. In the world, there are only interactions which happen according to mathematically regular rules.
You might as well talk about causality along the X-axis instead of the time axis: “the state of the universe at any point along the X axis can be known, with unlimited computing power and complete knowledge of any other Y,Z,T hyperplane.” If we were epistemically limited to a one-way view along the universe’s X-axis, and could see in both directions along the time axis, this would make sense.
Do you know Jon Williamson’s work? It seems to give an answer to your question (but I’ve not read it yet). Here’s the first paragraph of Section 9.1 “Mental yet Objective” of his book “Bayesian Nets and Causality”:
Here’s a link to his papers on causality. At least the fifth, “Causality”, contains an introduction to epistemic causality.
Nope, I wasn’t familiar. Very interesting, thanks!
That statement is too imprecise to capture Jaynes’s view of probability. He demonstrates (YMMV) that there is a unique way to assign probability to represent your degree of belief in propositions in a way that is consistent with certain desired properties of degrees of belief. That doesn’t make the probability assignment “true”, it just makes it consistent with your knowledge and the desired properties. IN particular, it won’t make the probability distribution you assign match some ill defined long term frequency of some event occurring.
Of course; it wasn’t intended to capture the difference between so-called objective Bayesianism vs. subjective Bayesianism. The tension, if it arises at all, arises from any sort of Bayesianism. That the rules prescribed by Jaynes don’t pick out the “true” probability distributions on a certain question is compatible with probability claims like “It will probably rain tomorrow” having a truth-value.
I was pointing out that your original statement characterizing “most people here” as asserting that “probability claims are true …” is antithetical to Jaynes’s approach, which I take as the canonical, if not universal, view on this list.
I don’t see the relation between the two. It seems like you’re pointing out that Jaynes/people here don’t believe there are “objectively correct” probability distributions that rationality compels us to adopt. But this is compatible with there being true probability claims, given one’s own probability distribution—which is all that’s required.
There may be an objectively correct way to throw globs of paint at the wall if I wish to do it in a way that is consistent with certain desired properties given my state of knowledge. That would not make that correct way of throwing globs of paint “true”.
A la Jaynes, there is a correct way to assign degrees of belief based on your state of knowledge if you want your degrees of belief to be consistent with certain constraints, but that doesn’t make any particular probability assignment “true”. Probability assignments don’t have truth value, they assign degrees of belief to propositions that do have truth value. It is a category error, under Jaynes perspective, to assert that a probability assignment is “true”, or purple, or hairy, or smelly.
Sure they do. If you’re a Bayesian, an agent truly asserts that the (or, better, his) probability of a claim is X iff his degree of belief in the claim is X, however you want to cash out “degree of belief”. Of course, there are other questions about the “normatively correct” degrees of belief that anyone in the agent’s position should possess, and maybe those lack determinate truth-value.
If I scratch my nose, that action has no truth value. No color either.
The proposition “I scratched my nose” does have a truth value.
See the distinction. Don’t hand wave it with “it’s all the same”, “that’s just semantics”, etc. You started saying that this is more of a question. I’ve tried to clarify the answer to you.
Bayesian epistemology maintains that probability is degree of belief. Assertions of probabilities are therefore assertions of degrees of belief, which are psychological claims and therefore obviously have or can have truth-value. Of course, Bayesians can be more nuanced and take some probability claims to be about degrees of belief in the minds of some idealized reasoner; but “the degree of belief of an idealized reasoner would be X given such-and-such” is still truth-evaluable.
The question was primarily about the role of probability in Pearl’s account of causality, not the basic meaning of probability in Bayesian epistemology.
Hm, good point.
I think the oddness is because good Bayesians usually don’t treat the real world as probabilistic, but a pearl-ian model of the world has inherent probabilities. Ideal reasoners can’t be automatically right about causal relations, and that’s because probability is in two different “places” in the two models. The ideal reasoner is automatically right about its probabilities, but it isn’t automatically right about these inherent probabilities in the territory. There’s no problem with the probability not corresponding exactly to the world, because that’s usually how it is—probabilities are merely the best you can do.
So huh.
I said this once above, but it’s worth repeating—Pearl’s view of causality has nothing to do with probabilities. It’s a fully deterministic theory which can be augmented by modeling uncertainty via probability theory if you want.
A causal model uniquely specifies a bunch of conditional probabilities, right?
Only in a sense that a first order theory of natural numbers does. Really I think it is more accurate to view “a causal model” as a model in the mathematical logic sense—an object about which logical assertions can be made. In the case of causal models, these assertions are modelling “interventions.” Here’s a paper on this:
http://www.jair.org/papers/paper648.html
This view appears in Pearl’s chapter 7, as well.