(a) You don’t need to observe confounders to learn structure from data. In fact, sometimes you don’t need any standard conditional independence at all. (Luke gave me the impression SI wasn’t very interested in that point—maybe it should be).
(b) Occam’s razor / faithfulness gives you enough to learn the structure of statistical models, not causal ones. You need additional assumptions to equate the statistical models you learn with causal models. Bayesian networks are not causal models. Causality is not about conditional independence, it is about counterfactual invariance, that is causality expresses what changes or stays the same after a hypothetical ‘wiggle.’
There is no guarantee that even given Occam’s razor and faithfulness being true that the graph you obtain is such that if I wiggle a parent, the child will change. To verify your causal assumptions, you have to run an experiment, or no scientist will believe your graph is causal. This is what real causal discovery papers do, for example:
Here they learned a protein signaling network, but then implemented an experiment where they changed the protein level of a parent via an RNA molecule, and verified that the child changed, but parent of a parent did not change.
I am sure you can set up a Bayesian story for this entire enterprise, if you wanted. But, firstly, this Bayesian story would not be expressed purely in probability theory but in the language that can express counterfactual invariance and talk about experiments (for example language of potential outcomes or do(.)). And secondly, giving something a Bayesian story is sort of equivalent to re-expressing some complicated program as a vi macro. Could be done (vi is turing-complete!) but why? People don’t write practical code in vi macros.
This sounds like we’re talking past each other somehow. Your point (a) is not clear to me—I was saying that to learn a sufficiently high-probability causal model from non-intervention data, you need to have observed the data in sufficient detail to rule out confounders (except at some low probability) (via simplicity priors, which otherwise can’t drive down the probability of an untestable invisible confounder by all that far). This can certainly be done in principle, e.g. if you put the system under a microscope with a higher resolution than the system, and verified there were only X kinds of stuff in it and no others.
Your point (b) sounds just plain wrong to me. If you have a simplicity prior over causal models, and you can derive testable probable predictions from causal models, then you can do Bayesian updating and get a posterior over causal models. Substituting the word “flammable fizzbins” for “causal models” in the preceding sentence will produce another true sentence. I think you mean something different by “Bayesian” and “Occam’s Razor” than I do.
By (a) I mean that you can sometimes get the true graph exactly even without having to observe confounders. Actually this was sort of known already (see the FCI algorithm, or even the IC* algorithm in Pearl’s book), but we can do a lot better than that. For example, if we have the true graph:
a → b → c → d, with a ← u1 → c, and a ← u2 → d, where we do not observe u1,u2, and u1,u2 are very complicated, then we can figure out the true graph exactly by independence type techniques without having to observe u1 and u2. Note: the marginal distribution p(a,b,c,d) that came from this graph has no conditional independences at all (checkable by d-separation on a,b,c,d), so typical techniques fail.
(b) is I guess “a subtle issue”—but my point is about careful language use and keeping causal and statistical issues clear and separate.
A “Bayesian network” (or “belief network”—I don’t like the word Bayesian here because it is confusing the issue, you can use frequentist techniques with belief networks if you wanted, in fact a lot of folks do) is a joint distribution that factorizes as a DAG. That’s it. Nothing about causality. If there is a joint density representing a causal process where a is a direct cause of b is a direct cause of c, then this joint density will factorize with respect to both
a → b → c
and
a ← b ← c
but only the former graph is causal, the latter is not. Both graphs form a “Bayesian network” with the joint density (since the density factorizes with respect to both graphs), but only one graph is a causal graph. If you want to talk about causal models, in addition to saying that there is a Markov factorization you also need to say something else—something that makes parents into direct causes. Usually people say something like:
for every x, p(x | pa(x)) = p(x | do(pa(x))), or mention the g-formula, or the truncated factorization of do(.), or “the causal Markov condition.”
But this is something that (a) you need to say explicitly, and (b) involves language beyond standard probability theory because there is a do(.), and (c) is controversial to some people. What is do(.)? It refers to a hypothetical experiment/intervention.
If all you are learning is a graph that gives you a Markov factorization you have no business making claims about interventions—interventions are a separate magisterium. You can assume that the unknown graph from which the data came is causal—but you need to say this explicitly, this assumption will be controversial to some people, and by making that assumption you are I think committing yourself to the use of interventionist/potential outcome language (just to describe what it means for a data generating graph to be causal).
I have no problems with you doing Bayesian updating and getting posteriors over causal models—I just wanted to get more precision on what a causal model is. A causal model is not a density factorizing with respect to a DAG—that’s a statistical model. A causal model makes assertions that relate hypothetical experiments like p(x | do(pa(x))) with observed data like p(x | pa(x)). So your Bayesian updating is operating in a world that contains more than just probability theory (which is a theory of standard joint densities, without the mention of do(.) or hypothetical experiments).
You can in fact augment probability theory with a logical description of interventions, see for example this paper:
If your notion of causal model does not relate do(.) to observed data, then I don’t know what you mean by a causal model. It’s certainly not what I mean by it.
Well, this is very rapidly getting us into complex territory that future decision-theory posts will hopefully explore, but a very brief answer would be that I am unwilling to define anything fundamental in terms of do() operations because our universe does not contain any do() operations, and counterfactuals are not allowed to be part of our fundamental ontology because nothing counterfactual actually exists and no counterfactual universes are ever observed. There are quarks and electrons, or rather amplitude distributions over joint quark and lepton fields; but there is no do() in physics.
Causality seems to exist, in the sense that the universe seems completely causally structured—there is causality in physics. On a microscopic level where no “experiments” ever take place and there are no uncertainties, the microfuture is still related to the micropast with a neighborhood-structure whose laws would yield a continuous analogue of D-separation if we became uncertain of any variables.
Counterfactuals are human hypothetical constructs built on top of high-level models of this actually-existing causality. Experiments do not perform actual interventions and access alternate counterfactual universes hanging alongside our own, they just connect hopefully-Markov random numbers into a particular causal arrow.
Another way of saying this is that a high-level causal model is more powerful than a high-level statistical model because it can induct and describe switches, as causal processes, which behave as though switching arrows around, and yields predictions for this new case even when the settings of the switches haven’t been observed before. This is a fancypants way of saying that a causal model lets you throw a bunch of rocks at trees, and then predict what happens when you throw rocks at a window for the first time.
As an additional data point, I also still do not have a very good understanding of your ideas about causality (although I did note earlier that it seems rather different from Pearl’s (which are similar to Ilya’s)). I also note that nobody else seems to have a good understanding of your ideas, at least not enough to try to build upon them either here on LW or on the decision theory mailing list or try to explain them to me when I asked.
Interesting. Sorry to bother you further, but can I ask you to quote a particular sentence or paragraph above that seems unclear? Or was the above clear, but it implies other questions that aren’t clear, or the motivations aren’t clear?
As a third data point, I used to be very confused about your ideas about causality, but your recent writing has helped a lot. To make embarassingly clear how very wrong I’ve been able to be, some years ago when you’d told us about TDT but not given details, I thought you had a fully worked-out and justified theory about how a decision agent could use causal graphs to model its uncertainty about the output of platonic computations, and use do() on its own output to compute the utility of different courses of action, and I got very frustrated when I simply couldn’t figure out how to fill in the details of that...
...hmm. (I should probably clarify: when I say “use causal graphs to reason about”, I don’t mean in the ‘trivial’ sense you are actually using where the platonic computations cause other things but are themselves uncaused in the model; I mean some sort of system where different computations and/or logical facts about computations form a non-degenerate graph, and where do() severs one node somewhere in the middle of that graph from its parents.) “And”, I was going to say, “when you finally did tell us more, I had a strong oh moment when you said that you still weren’t able to give a completely satisfying theory/justification, but were reasonably satisfied with the version you had. But I still continued to think that my picture of what you had been trying to do had been correct, only you didn’t have a fully worked-out theory of it, either.” The actual quote that turned into this memory of things seems to be,
Note that this does not solve the remaining open problems in TDT (though Nesov and Dai may have solved one such problem with their updateless decision theory). Also, although this theory goes into much more detail about how to compute its counterfactuals than classical CDT, there are still some visible incompletenesses when it comes to generating causal graphs that include the uncertain results of computations, computations dependent on other computations, computations uncertainly correlated to other computations, computations that reason abstractly about other computations without simulating them exactly, and so on.
But there’s also this:
The three-sentence version is: Factor your uncertainty over (impossible) possible worlds into a causal graph that includes nodes corresponding to the unknown outputs of known computations; condition on the known initial conditions of your decision computation to screen off factors influencing the decision-setup; compute the counterfactuals in your expected utility formula by surgery on the node representing the logical output of that computation.
And later:
Those of you who’ve read the quantum mechanics sequence can extrapolate from past experience that I’m not bluffing.
Huh. In retrospect I can see how this matches my current understanding of what you’re doing, but comparing this to what I wrote in the first paragraph above (before searching for that post), it’s actually surprisingly nonobvious where the difference is between what you wrote back then and what I wrote just now to explain the way in which I had horribly misunderstood you...
Anyway. As for what you wrote in the great-grandparent, I had to read it slowly, but most of it makes perfect sense to me; the last paragraph I’m not quite as sure about, but there too I think I understand what you mean.
There is, however, one major point on which I currently feel confused. You seem to be saying that causal reasoning should be seen as a very fundamental principle of epistemology, and on your list of open problems, you have “Better formalize hybrid of causal and mathematical inference.” But it seems to me that if you just do inference about logical uncertainty, and the mathematical object you happen to be interested in is a cellular automaton or the PDE giving the time evolution of some field theory, then your probability distribution over the state at different times will necessarily happen to factor in such a way that it can be represented as a causal model. So why treat causality as something fundamental in your epistemology, and then require deep thinking about how to integrate it with the rest of your reasoning system, rather than treating it as an efficient way to compress some probability distributions, which then just automatically happens to apply to the mathematical objects representing our actual physics? (At this point, I ask this question not as a criticism, but simply to illustrate my current confusion.)
So why treat causality as something fundamental in your epistemology, and then require deep
thinking about how to integrate it with the rest of your reasoning system, rather than treating it as
an efficient way to compress some probability distributions, which then just automatically
happens to apply to the mathematical objects representing our actual physics?
Because causality is not about efficiently encoding anything. A causal process a → b → c is equally efficiently encoded via c → b → a.
But it seems to me that if you just do inference about logical uncertainty, and the mathematical
object you happen to be interested in is a cellular automaton or the PDE giving the time evolution
of some field theory, then your probability distribution over the state at different times will
necessarily happen to factor in such a way that it can be represented as a causal model.
This is not true, for lots of reasons, one of them having to do with “observational equivalence.” A given causal graph has many different graphs with which it agrees on all observable constraints. All these other graphs are not causal. The 3 node chain above is one example.
Sorry, I understand the technical point about causal graphs you are refering to, but I do not understand the argument you’re trying to make with it in this context.
Suppose it’s the year 2100, and we have figured out the true underlying laws of physics, and it turns out that we run on a cellular automaton, and we have some very large and energy-intensive instruments that allow us to set up experiments where we can precisely set up the states of individual primitive cells. Now we want to use probabilistic reasoning to examine the time evolution of a cluster of such cells if we have only probabilistic information about the boundary conditions. Since this is a completely ordinary cellular automaton, we can describe it using a causal model, where the state of a cell at time t+1 is caused by its own state and the state of its neighbours at time t.
In this case, causality is really fundamentally there in the laws of physics (in a discrete analog of what we suspect for our actual laws of physics). And though you can’t reach in from the outside of the universe, it’s possible to imagine scenarios where you could do the equivalent of do() on some of the cells in your experiment, though it wouldn’t really be done by acausally changing what happens in the universe—one way to imagine it is that your experiment runs only in a two-dimensional slice surrounded by a “vacuum” of cells in a “zero” state, and you can reach in through that vacuum to change one of the cells in the two-dimensional grid.
But when it comes to how to model this inside a computer, it seems that you can reach all the conclusions you need by “ordinary” probabilistic reasoning: For example, you could start with say a uniform joint probability distribution over the state of all cells in your experiment at all times; then you condition on the fact that they fulfill the laws of physics, i.e. the time evolution rule of the cellular automaton; then you condition again on what you know about the boundary conditions, e.g. the fact that your experimental apparatus reaches in through the third dimension at some point to change the state of some cells. It’s extraordinarily inefficient to represent the joint distribution as a giant look-up table of probabilities, but I do not see what inferences you want but are going to lose by doing the calculations that way.
(All of this holds even if the true laws happen to be deterministic in only one direction in time, so that in your experiment you can distinguish a → b → c from c → b → a by reaching in through the third dimension at time b.)
It depends on granularity. If you are talking about your game of life world on the level of the rules of the game, that is equivalent to talking about our Universe on the level of the universal wave function. In both cases there are no more agents with actuators and no more do(.), as a result. That is, it’s not that your factorization will be causal, it’s that there is no causality.
But if you are taking a more granular view of your game of life world, similar to the macroscopic view of our Universe, where there are agents that can push and prod their environment, then suddenly talking about do(.) becomes useful for getting things done (just like it is useful to talk about addition or derivatives). On this macroscopic level, there is causality, but then your statement about all factorizations being causal is false (due to obvious examples involving reversing causal chains, for example).
On second thought, the main problem may not be lack of clarity but that your ideas about causality are too speculative and people either lack confidence that your research program (try to reduce Pearl’s do()-based causality to lower-level “causality in physics”) is the right one, or do not see how to proceed.
Both apply for me but the former is perhaps more relevant at this point. Basically I’m not sure that “do()-based causality” will actually end up playing a role in the ultimate “correct” decision theory (I guess if there is lack of clarity, it’s why you think that it will), and in the mean time there are other problems that definitely need to be solved and also seem more approachable.
(To explain why I think “do()-based causality” may not end up playing a role, it seems plausible that in an AI or at least decision theory (I wanted to say theoretical decision theory but that seems redundant :), cognition about “high-level causality” just ends up being handled as a special case by a more general algorithm, similar to how an AI programmed to maximize expected utility wouldn’t specifically need to be hand-coded with natural language processing if it was running on a sufficiently powerful computer.)
ETA: BTW, can you comment on whether my understanding in this comment was correct, and whether they still apply to Eliezer_2012?
You realize I’m arguing against do()-based causality? If not, I was very much unclearer than I thought.
I have never tried to reduce causal arrows to similarity; Barbour does, I don’t. I take causality to be, or be the epistemic conjugate of, something physical and real which was involved in manufacturing this oddly-well-modeled-by-causality universe that we actually live in. They are presently primitive in my model; I have not yet reduced them, except in the obvious sense that they are also formal mathematical relations between points, i.e., causal relations are a special case of logical relations (and yet we still live in a causal universe rather than a merely logical one). I do indeed reduce consciousness to computation and computation to causality, though there’s a step here involving magical reality-fluid about which I am still confused—I have no idea why or what it means for a causal process to be more or less real, either as a result of having more or less Born measure, being instantiated in many places, or for any other reason.
You realize I’m arguing against do()-based causality? If not, I was very much unclearer than I thought.
Maybe it’s just me not updating fast enough. My impression is that when you talked about causality prior to today, you usually mentioned Pearl and never said you disagreed with him on anything, so I assumed you wanted to keep his do()-based causality and just add a layer below it. Were you always against do()-based causality or did you change your mind at some point?
I have never tried to reduce causal arrows to similarity; Barbour does, I don’t.
Hmm, re-reading Timeless Causality, I don’t see how I could have learned that the idea belongs to Barbour and that you disagree with him. It sure sounds like it was your idea.
causal relations are a special case of logical relations (and yet we still live in a causal universe rather than a merely logical one)
Why should we care about causality as decision theorists, if we have decision theories that can deal with logical universes in general, and causal relations are just a special case of logical relations?
Hmm, re-reading Timeless Causality, I don’t see how I could have learned that the idea belongs to Barbour and that you disagree with him. It sure sounds like it was your idea.
This sounds like a high-priority problem, but actually I don’t see any reference to reduction-to-similarity in Timeless Causality, although there’s a lot in Barbour’s book about it. What do you mean by “mind reduces to computation which reduces to causal arrows which reduces to some sort of similarity relationship between configurations”? Unless this is just in the sense that causal mechanisms are logical relations?
I interpreted this paragraph as sugesting that causality reduces to similarity, but given your latest clarifications, I guess what you actually had in mind was that causality tends to produce similarity and so we can infer causality from similarity.
When two regions of spacetime are timelike separated, we cannot deduce any direction of causality from similarities between them; they could be similar because one is cause and one is effect, or vice versa. But when two regions of spacetime are spacelike separated, and far enough apart that they have no common causal ancestry assuming one direction of physical causality, but would have common causal ancestry assuming a different direction of physical causality, then similarity between them… is at least highly suggestive.
Previously, I thought you considered causality to be a higher level concept rather than a primitive one, similar to “sound waves” or “speech” as opposed to say “particle movements”. That sort of made sense except that I didn’t know why you wanted to make causality an integral part of decision theory. Now you’re saying that you consider causality to be primitive and a special kind of logical relations, which actually makes less sense to me, and still doesn’t explain why you want to make causality an integral part of decision theory. It makes less sense because if we consider the laws of physics as logical relations, they don’t have a direction. As you said, “Time-symmetrical laws of physics didn’t seem to leave room for asymmetrical causality.” I don’t see how you get around this problem if you take causality to be primitive. But the bigger problem is that (at the risk of repeating myself too many times) I don’t understand your motivation for studying causality, because if I did I’d probably spend more time thinking about it mysef and understand your ideas about it better.
I’m trying to think like reality. If causality isn’t a special kind of logic, why is everything in the known universe made out of (a continuous analogue of) causality instead of logic in general? Why not Time-Turners or a zillion other possibilities?
If causality isn’t a special kind of logic, why is everything in the known universe made out of (a continuous analogue of) causality instead of logic in general?
Wait, if causality is a special kind of logic, how does that help answer the question? Don’t we still have to answer why the universe is made of this kind of logical instead of some other?
Why not Time-Turners or a zillion other possibilities?
I don’t understand how lack of Time-Turners makes you think causality is a special kind of logic or why you want to incorporate causality into decision theory (which is still my bigger question). Similar questions could be asked about other features of the universe:
Why does the universe have 3 spatial dimensions instead of a zillion other possibilities?
Why doesn’t the laws of physics allow information to be destroyed (i.e., never maps 2 different states at time t to the same state at time t+1)?
But we’re not concerned about these questions at the level of decision theory, since it seems possible to have a decision theory that works with an arbitrary number of dimensions, and with both kinds of laws of physics. Similarly, I don’t see why we can’t have a “causality-agnostic” decision theory that works in universes both with and without Time-Turners.
I think the point was more about whether causality should be thought of as a fundamental part of the rules, like this, or whether it’s more useful to think of causality as an abstraction that (ahem, excuse the term) “emerges” from the fundamentals when we try to identify patterns in said fundamentals.
Somewhat akin to how “meaning” exists in a computer program despite none of the bits fundamentally meaning anything, I think. My thoughts are becoming more and more confused as I type, though, which makes me wish I had an environment suitable to better concentration.
You realize I’m arguing against do()-based causality?
Ok, I would like to state for the record that I no longer understand what you mean when you say “factor something as a causal graph” (which may well mean no one else on this site understands either). Basically everything you ever wrote on the subject of causality or causal graphs (other than exposition of standard material) is now a complete mystery to me. In particular, I don’t understand what sorts of graphs are in your paper on the Newcomb’s problem, or why those graphs justify you to make any sorts of conclusions about Newcomb’s problem.
Graph models are overloaded, there are lots of different models that all have the same graph. You have to explain what you mean if you use graphs.
I would be interested in reading about this. A few points:
(a) I agree that causality is a “useful fiction” (like real numbers or derivatives).
(b) If you are going to be writing posts about “causal diagrams” you need to be clear about what you mean. Usually by causal diagrams people mean Pearl’s stuff, or closely related stuff (agnostic causal models, minimal causal models, etc.) All these models are defined via either do(.) or stronger notation. If you do not mean that by causal diagrams, that’s fine! But please explain what you do mean to avoid confusing people. You have a paper on TDT that seems to use causal diagrams. Which ones did you mean in there?
edit: I should say that if your project has “defining actual cause” as a special case, it’s probably a black hole from which no one returns (it’s the analytic philosophy version of the P/NP problem).
edit 2: I think the derivation of “do(.)” ought to be not dissimilar to the derivation of “+”, if you worry about induction problems. “+” is a mathematical fiction very useful for representing regularities with handling objects, “do(.)” is a mathematical fiction very useful for representing regularities involved with algorithms with actuators running around.
If causality is’ useful fiction, it’s conjugate to some useful nonfiction; I should like to know what the latter is.
I don’t think Pearl’s diagrams are defined via do(). I think I disagree with that statement even if you can find Pearl making it. Even if do() - as shorthand for describing experimental procedures involving switches on arrows—does happen to be a procedure you can perform on those diagrams, that’s a consequence of the definition, it is not actually part of the representation of the actual causal model. You can write out causal models, and they give predictions—this suffices to define them as hypotheses.
More importantly: How can you possibly make the truth-condition be a correspondence to counterfactual universes that don’t actually exist? That’s the point of my whole epistemology sequence—truth-conditions get defined relative to some combination of physical reality that actually exists, and valid logical consequences pinned down by axioms. So yes, I would definitely derive do() rather than have it being primitive, and I wouldn’t ever talk about the truth-condition of causal models relative to a do() out there in the environment—we talk about the truth-condition of causal models relative to quarks and electrons and quantum fields, to reality.
I’m a bit worried (from some of his comments about causal decision theory) that Pearl may actually believe in free will, or did when he wrote the first edition of Causality. In reality nothing is without parents, nothing is physically uncaused—that’s the other problem with do().
I don’t think Pearl’s diagrams are defined via do(). I think I disagree with that statement even if you
can find Pearl making it.
Well, the author is dead, they say.
There are actually two separate causal models in Pearl’s book: “causal Bayesian networks” (chapter 1), and “functional models” aka “non-parametric structural equation models” (chapter 7). These models are not the same, in fact functional models are a lot stronger logically (that is they make many more assumptions).
The first is defined via do(.), you can check the definition. The second can be defined either via a set of functions, or via a set of axioms. The two definitions are, I believe, equivalent. The axiomatic approach is valuable in statistics, where we often cannot exhibit the functions that make up the model, and must resort to enumerating assumptions. If you want to take the axiomatic approach you need a language stronger than do(.). In particular you need to be able to express counterfactual statements of the form “I have a headache. Would I have a headache had I taken an aspirin one hour ago?” Pearl’s model in chapter 7 actually makes assumptions about counterfactuals like that. If you think talking about counterfactual worlds that don’t actually exist is dubious, then you join a large chorus of folks who are critical of Pearl’s functional models.
If you want to learn more about different kinds of causal models people look at, and the criticisms of models that make assumptions on counterfactuals, the following is a good read:
Some folks claim that a model is not causal unless it assumes consistency, which is an axiom stating that if for a person u, we intervene on X and set it to a value x that naturally occurs in u, then for any Y in u, the value of Y given that intervention is equal to the value of Y in that same person had we not intervened on X at all. Or, concisely:
Y(x,u) = Y(u), if X(u) = x
or even more concisely:
Y(X) = Y
This assumption is actually counterfactual. Without this assumption it’s not possible to do causal inference.
Reading this whole thread, I’m interested to know what your thoughts on causality are. Do you have existing posts on the subject that I should re-read? I was under the impression you pretty much agreed with Pearl, but now that seems not to be the case.
By the way, Pearl certainly wasn’t arguing from a “free will” perspective—rather, I think he’d agree with “there is no do() in physics” but disagree that “there is causality in physics”.
a → b → c → d, with a ← u1 → c, and a ← u2 → d, where we do not observe u1,u2, and u1,u2 are very complicated, then we can figure out the true graph exactly by independence type techniques without having to observe u1 and u2. Note: the marginal distribution p(a,b,c,d) that came from this graph has no conditional independences at all (checkable by d-separation on a,b,c,d), so typical techniques fail.
Luke gave me the impression SI wasn’t very interested in that point
How? I find myself very interested in this point, just not enough to schedule a lecture about it in the next month, since we have a lot of other things going on, and we’re out of town, and so on.
On your account, how do you learn causal models from observing someone else perform an experiment? That doesn’t involve any interventions or counterfactuals. You only see what actually happens, in a system that includes a scientist.
That depends what you mean by an “experiment.” If you divide a set of patients into a control group and a test group, and then have the test group smoke a pack of cigarettes per day, that is an “experiment” to me, one that is represented by an intervention (because we are forcing the test group to smoke regardless of what they would naturally want to do).
Observing that the test group is much more likely to develop cancer would lead me to conclude that the graph
smoking → cancer
is a causal graph rather than merely a statistical graph.
If we do not perform the above experiment due to ethical reasons, but instead use observational data on smokers, we have to worry about confounders, like Fisher did. We also have to worry, because we are implicitly linking that data with counterfactual situations (what would have happened if those guys we observed were forced to smoke). This linking isn’t “free,” there are assumptions operating in the background. Assumptions expressed in a language that can talk about counterfactual situations.
(a) You don’t need to observe confounders to learn structure from data. In fact, sometimes you don’t need any standard conditional independence at all. (Luke gave me the impression SI wasn’t very interested in that point—maybe it should be).
(b) Occam’s razor / faithfulness gives you enough to learn the structure of statistical models, not causal ones. You need additional assumptions to equate the statistical models you learn with causal models. Bayesian networks are not causal models. Causality is not about conditional independence, it is about counterfactual invariance, that is causality expresses what changes or stays the same after a hypothetical ‘wiggle.’
There is no guarantee that even given Occam’s razor and faithfulness being true that the graph you obtain is such that if I wiggle a parent, the child will change. To verify your causal assumptions, you have to run an experiment, or no scientist will believe your graph is causal. This is what real causal discovery papers do, for example:
http://www.sciencemag.org/content/308/5721/523.abstract
Here they learned a protein signaling network, but then implemented an experiment where they changed the protein level of a parent via an RNA molecule, and verified that the child changed, but parent of a parent did not change.
I am sure you can set up a Bayesian story for this entire enterprise, if you wanted. But, firstly, this Bayesian story would not be expressed purely in probability theory but in the language that can express counterfactual invariance and talk about experiments (for example language of potential outcomes or do(.)). And secondly, giving something a Bayesian story is sort of equivalent to re-expressing some complicated program as a vi macro. Could be done (vi is turing-complete!) but why? People don’t write practical code in vi macros.
This sounds like we’re talking past each other somehow. Your point (a) is not clear to me—I was saying that to learn a sufficiently high-probability causal model from non-intervention data, you need to have observed the data in sufficient detail to rule out confounders (except at some low probability) (via simplicity priors, which otherwise can’t drive down the probability of an untestable invisible confounder by all that far). This can certainly be done in principle, e.g. if you put the system under a microscope with a higher resolution than the system, and verified there were only X kinds of stuff in it and no others.
Your point (b) sounds just plain wrong to me. If you have a simplicity prior over causal models, and you can derive testable probable predictions from causal models, then you can do Bayesian updating and get a posterior over causal models. Substituting the word “flammable fizzbins” for “causal models” in the preceding sentence will produce another true sentence. I think you mean something different by “Bayesian” and “Occam’s Razor” than I do.
By (a) I mean that you can sometimes get the true graph exactly even without having to observe confounders. Actually this was sort of known already (see the FCI algorithm, or even the IC* algorithm in Pearl’s book), but we can do a lot better than that. For example, if we have the true graph:
a → b → c → d, with a ← u1 → c, and a ← u2 → d, where we do not observe u1,u2, and u1,u2 are very complicated, then we can figure out the true graph exactly by independence type techniques without having to observe u1 and u2. Note: the marginal distribution p(a,b,c,d) that came from this graph has no conditional independences at all (checkable by d-separation on a,b,c,d), so typical techniques fail.
(b) is I guess “a subtle issue”—but my point is about careful language use and keeping causal and statistical issues clear and separate.
A “Bayesian network” (or “belief network”—I don’t like the word Bayesian here because it is confusing the issue, you can use frequentist techniques with belief networks if you wanted, in fact a lot of folks do) is a joint distribution that factorizes as a DAG. That’s it. Nothing about causality. If there is a joint density representing a causal process where a is a direct cause of b is a direct cause of c, then this joint density will factorize with respect to both
a → b → c
and
a ← b ← c
but only the former graph is causal, the latter is not. Both graphs form a “Bayesian network” with the joint density (since the density factorizes with respect to both graphs), but only one graph is a causal graph. If you want to talk about causal models, in addition to saying that there is a Markov factorization you also need to say something else—something that makes parents into direct causes. Usually people say something like:
for every x, p(x | pa(x)) = p(x | do(pa(x))), or mention the g-formula, or the truncated factorization of do(.), or “the causal Markov condition.”
But this is something that (a) you need to say explicitly, and (b) involves language beyond standard probability theory because there is a do(.), and (c) is controversial to some people. What is do(.)? It refers to a hypothetical experiment/intervention.
If all you are learning is a graph that gives you a Markov factorization you have no business making claims about interventions—interventions are a separate magisterium. You can assume that the unknown graph from which the data came is causal—but you need to say this explicitly, this assumption will be controversial to some people, and by making that assumption you are I think committing yourself to the use of interventionist/potential outcome language (just to describe what it means for a data generating graph to be causal).
I have no problems with you doing Bayesian updating and getting posteriors over causal models—I just wanted to get more precision on what a causal model is. A causal model is not a density factorizing with respect to a DAG—that’s a statistical model. A causal model makes assertions that relate hypothetical experiments like p(x | do(pa(x))) with observed data like p(x | pa(x)). So your Bayesian updating is operating in a world that contains more than just probability theory (which is a theory of standard joint densities, without the mention of do(.) or hypothetical experiments). You can in fact augment probability theory with a logical description of interventions, see for example this paper:
http://www.jair.org/papers/paper648.html
If your notion of causal model does not relate do(.) to observed data, then I don’t know what you mean by a causal model. It’s certainly not what I mean by it.
Well, this is very rapidly getting us into complex territory that future decision-theory posts will hopefully explore, but a very brief answer would be that I am unwilling to define anything fundamental in terms of do() operations because our universe does not contain any do() operations, and counterfactuals are not allowed to be part of our fundamental ontology because nothing counterfactual actually exists and no counterfactual universes are ever observed. There are quarks and electrons, or rather amplitude distributions over joint quark and lepton fields; but there is no do() in physics.
Causality seems to exist, in the sense that the universe seems completely causally structured—there is causality in physics. On a microscopic level where no “experiments” ever take place and there are no uncertainties, the microfuture is still related to the micropast with a neighborhood-structure whose laws would yield a continuous analogue of D-separation if we became uncertain of any variables.
Counterfactuals are human hypothetical constructs built on top of high-level models of this actually-existing causality. Experiments do not perform actual interventions and access alternate counterfactual universes hanging alongside our own, they just connect hopefully-Markov random numbers into a particular causal arrow.
Another way of saying this is that a high-level causal model is more powerful than a high-level statistical model because it can induct and describe switches, as causal processes, which behave as though switching arrows around, and yields predictions for this new case even when the settings of the switches haven’t been observed before. This is a fancypants way of saying that a causal model lets you throw a bunch of rocks at trees, and then predict what happens when you throw rocks at a window for the first time.
As an additional data point, I also still do not have a very good understanding of your ideas about causality (although I did note earlier that it seems rather different from Pearl’s (which are similar to Ilya’s)). I also note that nobody else seems to have a good understanding of your ideas, at least not enough to try to build upon them either here on LW or on the decision theory mailing list or try to explain them to me when I asked.
Interesting. Sorry to bother you further, but can I ask you to quote a particular sentence or paragraph above that seems unclear? Or was the above clear, but it implies other questions that aren’t clear, or the motivations aren’t clear?
As a third data point, I used to be very confused about your ideas about causality, but your recent writing has helped a lot. To make embarassingly clear how very wrong I’ve been able to be, some years ago when you’d told us about TDT but not given details, I thought you had a fully worked-out and justified theory about how a decision agent could use causal graphs to model its uncertainty about the output of platonic computations, and use do() on its own output to compute the utility of different courses of action, and I got very frustrated when I simply couldn’t figure out how to fill in the details of that...
...hmm. (I should probably clarify: when I say “use causal graphs to reason about”, I don’t mean in the ‘trivial’ sense you are actually using where the platonic computations cause other things but are themselves uncaused in the model; I mean some sort of system where different computations and/or logical facts about computations form a non-degenerate graph, and where do() severs one node somewhere in the middle of that graph from its parents.) “And”, I was going to say, “when you finally did tell us more, I had a strong oh moment when you said that you still weren’t able to give a completely satisfying theory/justification, but were reasonably satisfied with the version you had. But I still continued to think that my picture of what you had been trying to do had been correct, only you didn’t have a fully worked-out theory of it, either.” The actual quote that turned into this memory of things seems to be,
But there’s also this:
And later:
Huh. In retrospect I can see how this matches my current understanding of what you’re doing, but comparing this to what I wrote in the first paragraph above (before searching for that post), it’s actually surprisingly nonobvious where the difference is between what you wrote back then and what I wrote just now to explain the way in which I had horribly misunderstood you...
Anyway. As for what you wrote in the great-grandparent, I had to read it slowly, but most of it makes perfect sense to me; the last paragraph I’m not quite as sure about, but there too I think I understand what you mean.
There is, however, one major point on which I currently feel confused. You seem to be saying that causal reasoning should be seen as a very fundamental principle of epistemology, and on your list of open problems, you have “Better formalize hybrid of causal and mathematical inference.” But it seems to me that if you just do inference about logical uncertainty, and the mathematical object you happen to be interested in is a cellular automaton or the PDE giving the time evolution of some field theory, then your probability distribution over the state at different times will necessarily happen to factor in such a way that it can be represented as a causal model. So why treat causality as something fundamental in your epistemology, and then require deep thinking about how to integrate it with the rest of your reasoning system, rather than treating it as an efficient way to compress some probability distributions, which then just automatically happens to apply to the mathematical objects representing our actual physics? (At this point, I ask this question not as a criticism, but simply to illustrate my current confusion.)
Because causality is not about efficiently encoding anything. A causal process a → b → c is equally efficiently encoded via c → b → a.
This is not true, for lots of reasons, one of them having to do with “observational equivalence.” A given causal graph has many different graphs with which it agrees on all observable constraints. All these other graphs are not causal. The 3 node chain above is one example.
Sorry, I understand the technical point about causal graphs you are refering to, but I do not understand the argument you’re trying to make with it in this context.
Suppose it’s the year 2100, and we have figured out the true underlying laws of physics, and it turns out that we run on a cellular automaton, and we have some very large and energy-intensive instruments that allow us to set up experiments where we can precisely set up the states of individual primitive cells. Now we want to use probabilistic reasoning to examine the time evolution of a cluster of such cells if we have only probabilistic information about the boundary conditions. Since this is a completely ordinary cellular automaton, we can describe it using a causal model, where the state of a cell at time t+1 is caused by its own state and the state of its neighbours at time t.
In this case, causality is really fundamentally there in the laws of physics (in a discrete analog of what we suspect for our actual laws of physics). And though you can’t reach in from the outside of the universe, it’s possible to imagine scenarios where you could do the equivalent of do() on some of the cells in your experiment, though it wouldn’t really be done by acausally changing what happens in the universe—one way to imagine it is that your experiment runs only in a two-dimensional slice surrounded by a “vacuum” of cells in a “zero” state, and you can reach in through that vacuum to change one of the cells in the two-dimensional grid.
But when it comes to how to model this inside a computer, it seems that you can reach all the conclusions you need by “ordinary” probabilistic reasoning: For example, you could start with say a uniform joint probability distribution over the state of all cells in your experiment at all times; then you condition on the fact that they fulfill the laws of physics, i.e. the time evolution rule of the cellular automaton; then you condition again on what you know about the boundary conditions, e.g. the fact that your experimental apparatus reaches in through the third dimension at some point to change the state of some cells. It’s extraordinarily inefficient to represent the joint distribution as a giant look-up table of probabilities, but I do not see what inferences you want but are going to lose by doing the calculations that way.
(All of this holds even if the true laws happen to be deterministic in only one direction in time, so that in your experiment you can distinguish a → b → c from c → b → a by reaching in through the third dimension at time b.)
It depends on granularity. If you are talking about your game of life world on the level of the rules of the game, that is equivalent to talking about our Universe on the level of the universal wave function. In both cases there are no more agents with actuators and no more do(.), as a result. That is, it’s not that your factorization will be causal, it’s that there is no causality.
But if you are taking a more granular view of your game of life world, similar to the macroscopic view of our Universe, where there are agents that can push and prod their environment, then suddenly talking about do(.) becomes useful for getting things done (just like it is useful to talk about addition or derivatives). On this macroscopic level, there is causality, but then your statement about all factorizations being causal is false (due to obvious examples involving reversing causal chains, for example).
On second thought, the main problem may not be lack of clarity but that your ideas about causality are too speculative and people either lack confidence that your research program (try to reduce Pearl’s do()-based causality to lower-level “causality in physics”) is the right one, or do not see how to proceed.
Both apply for me but the former is perhaps more relevant at this point. Basically I’m not sure that “do()-based causality” will actually end up playing a role in the ultimate “correct” decision theory (I guess if there is lack of clarity, it’s why you think that it will), and in the mean time there are other problems that definitely need to be solved and also seem more approachable.
(To explain why I think “do()-based causality” may not end up playing a role, it seems plausible that in an AI or at least decision theory (I wanted to say theoretical decision theory but that seems redundant :), cognition about “high-level causality” just ends up being handled as a special case by a more general algorithm, similar to how an AI programmed to maximize expected utility wouldn’t specifically need to be hand-coded with natural language processing if it was running on a sufficiently powerful computer.)
ETA: BTW, can you comment on whether my understanding in this comment was correct, and whether they still apply to Eliezer_2012?
You realize I’m arguing against do()-based causality? If not, I was very much unclearer than I thought.
I have never tried to reduce causal arrows to similarity; Barbour does, I don’t. I take causality to be, or be the epistemic conjugate of, something physical and real which was involved in manufacturing this oddly-well-modeled-by-causality universe that we actually live in. They are presently primitive in my model; I have not yet reduced them, except in the obvious sense that they are also formal mathematical relations between points, i.e., causal relations are a special case of logical relations (and yet we still live in a causal universe rather than a merely logical one). I do indeed reduce consciousness to computation and computation to causality, though there’s a step here involving magical reality-fluid about which I am still confused—I have no idea why or what it means for a causal process to be more or less real, either as a result of having more or less Born measure, being instantiated in many places, or for any other reason.
Maybe it’s just me not updating fast enough. My impression is that when you talked about causality prior to today, you usually mentioned Pearl and never said you disagreed with him on anything, so I assumed you wanted to keep his do()-based causality and just add a layer below it. Were you always against do()-based causality or did you change your mind at some point?
Hmm, re-reading Timeless Causality, I don’t see how I could have learned that the idea belongs to Barbour and that you disagree with him. It sure sounds like it was your idea.
Why should we care about causality as decision theorists, if we have decision theories that can deal with logical universes in general, and causal relations are just a special case of logical relations?
This sounds like a high-priority problem, but actually I don’t see any reference to reduction-to-similarity in Timeless Causality, although there’s a lot in Barbour’s book about it. What do you mean by “mind reduces to computation which reduces to causal arrows which reduces to some sort of similarity relationship between configurations”? Unless this is just in the sense that causal mechanisms are logical relations?
I interpreted this paragraph as sugesting that causality reduces to similarity, but given your latest clarifications, I guess what you actually had in mind was that causality tends to produce similarity and so we can infer causality from similarity.
Previously, I thought you considered causality to be a higher level concept rather than a primitive one, similar to “sound waves” or “speech” as opposed to say “particle movements”. That sort of made sense except that I didn’t know why you wanted to make causality an integral part of decision theory. Now you’re saying that you consider causality to be primitive and a special kind of logical relations, which actually makes less sense to me, and still doesn’t explain why you want to make causality an integral part of decision theory. It makes less sense because if we consider the laws of physics as logical relations, they don’t have a direction. As you said, “Time-symmetrical laws of physics didn’t seem to leave room for asymmetrical causality.” I don’t see how you get around this problem if you take causality to be primitive. But the bigger problem is that (at the risk of repeating myself too many times) I don’t understand your motivation for studying causality, because if I did I’d probably spend more time thinking about it mysef and understand your ideas about it better.
I’m trying to think like reality. If causality isn’t a special kind of logic, why is everything in the known universe made out of (a continuous analogue of) causality instead of logic in general? Why not Time-Turners or a zillion other possibilities?
Wait, if causality is a special kind of logic, how does that help answer the question? Don’t we still have to answer why the universe is made of this kind of logical instead of some other?
I don’t understand how lack of Time-Turners makes you think causality is a special kind of logic or why you want to incorporate causality into decision theory (which is still my bigger question). Similar questions could be asked about other features of the universe:
Why does the universe have 3 spatial dimensions instead of a zillion other possibilities?
Why doesn’t the laws of physics allow information to be destroyed (i.e., never maps 2 different states at time t to the same state at time t+1)?
But we’re not concerned about these questions at the level of decision theory, since it seems possible to have a decision theory that works with an arbitrary number of dimensions, and with both kinds of laws of physics. Similarly, I don’t see why we can’t have a “causality-agnostic” decision theory that works in universes both with and without Time-Turners.
I think the point was more about whether causality should be thought of as a fundamental part of the rules, like this, or whether it’s more useful to think of causality as an abstraction that (ahem, excuse the term) “emerges” from the fundamentals when we try to identify patterns in said fundamentals.
Somewhat akin to how “meaning” exists in a computer program despite none of the bits fundamentally meaning anything, I think. My thoughts are becoming more and more confused as I type, though, which makes me wish I had an environment suitable to better concentration.
Ok, I would like to state for the record that I no longer understand what you mean when you say “factor something as a causal graph” (which may well mean no one else on this site understands either). Basically everything you ever wrote on the subject of causality or causal graphs (other than exposition of standard material) is now a complete mystery to me. In particular, I don’t understand what sorts of graphs are in your paper on the Newcomb’s problem, or why those graphs justify you to make any sorts of conclusions about Newcomb’s problem.
Graph models are overloaded, there are lots of different models that all have the same graph. You have to explain what you mean if you use graphs.
I would be interested in reading about this. A few points:
(a) I agree that causality is a “useful fiction” (like real numbers or derivatives).
(b) If you are going to be writing posts about “causal diagrams” you need to be clear about what you mean. Usually by causal diagrams people mean Pearl’s stuff, or closely related stuff (agnostic causal models, minimal causal models, etc.) All these models are defined via either do(.) or stronger notation. If you do not mean that by causal diagrams, that’s fine! But please explain what you do mean to avoid confusing people. You have a paper on TDT that seems to use causal diagrams. Which ones did you mean in there?
edit: I should say that if your project has “defining actual cause” as a special case, it’s probably a black hole from which no one returns (it’s the analytic philosophy version of the P/NP problem).
edit 2: I think the derivation of “do(.)” ought to be not dissimilar to the derivation of “+”, if you worry about induction problems. “+” is a mathematical fiction very useful for representing regularities with handling objects, “do(.)” is a mathematical fiction very useful for representing regularities involved with algorithms with actuators running around.
If causality is’ useful fiction, it’s conjugate to some useful nonfiction; I should like to know what the latter is.
I don’t think Pearl’s diagrams are defined via do(). I think I disagree with that statement even if you can find Pearl making it. Even if do() - as shorthand for describing experimental procedures involving switches on arrows—does happen to be a procedure you can perform on those diagrams, that’s a consequence of the definition, it is not actually part of the representation of the actual causal model. You can write out causal models, and they give predictions—this suffices to define them as hypotheses.
More importantly: How can you possibly make the truth-condition be a correspondence to counterfactual universes that don’t actually exist? That’s the point of my whole epistemology sequence—truth-conditions get defined relative to some combination of physical reality that actually exists, and valid logical consequences pinned down by axioms. So yes, I would definitely derive do() rather than have it being primitive, and I wouldn’t ever talk about the truth-condition of causal models relative to a do() out there in the environment—we talk about the truth-condition of causal models relative to quarks and electrons and quantum fields, to reality.
I’m a bit worried (from some of his comments about causal decision theory) that Pearl may actually believe in free will, or did when he wrote the first edition of Causality. In reality nothing is without parents, nothing is physically uncaused—that’s the other problem with do().
Well, the author is dead, they say.
There are actually two separate causal models in Pearl’s book: “causal Bayesian networks” (chapter 1), and “functional models” aka “non-parametric structural equation models” (chapter 7). These models are not the same, in fact functional models are a lot stronger logically (that is they make many more assumptions).
The first is defined via do(.), you can check the definition. The second can be defined either via a set of functions, or via a set of axioms. The two definitions are, I believe, equivalent. The axiomatic approach is valuable in statistics, where we often cannot exhibit the functions that make up the model, and must resort to enumerating assumptions. If you want to take the axiomatic approach you need a language stronger than do(.). In particular you need to be able to express counterfactual statements of the form “I have a headache. Would I have a headache had I taken an aspirin one hour ago?” Pearl’s model in chapter 7 actually makes assumptions about counterfactuals like that. If you think talking about counterfactual worlds that don’t actually exist is dubious, then you join a large chorus of folks who are critical of Pearl’s functional models.
If you want to learn more about different kinds of causal models people look at, and the criticisms of models that make assumptions on counterfactuals, the following is a good read:
http://events.iq.harvard.edu/events/sites/iq.harvard.edu.events/files/wp100.pdf
Some folks claim that a model is not causal unless it assumes consistency, which is an axiom stating that if for a person u, we intervene on X and set it to a value x that naturally occurs in u, then for any Y in u, the value of Y given that intervention is equal to the value of Y in that same person had we not intervened on X at all. Or, concisely:
Y(x,u) = Y(u), if X(u) = x
or even more concisely:
Y(X) = Y
This assumption is actually counterfactual. Without this assumption it’s not possible to do causal inference.
Reading this whole thread, I’m interested to know what your thoughts on causality are. Do you have existing posts on the subject that I should re-read? I was under the impression you pretty much agreed with Pearl, but now that seems not to be the case.
By the way, Pearl certainly wasn’t arguing from a “free will” perspective—rather, I think he’d agree with “there is no do() in physics” but disagree that “there is causality in physics”.
Irrelevant question: Isn’t (b || d) | a, c?
No, because b → c <-> a <-> d is an open path if you condition on c and a.
Ah, right.
How? I find myself very interested in this point, just not enough to schedule a lecture about it in the next month, since we have a lot of other things going on, and we’re out of town, and so on.
Fair enough, retracted. Sorry!
On your account, how do you learn causal models from observing someone else perform an experiment? That doesn’t involve any interventions or counterfactuals. You only see what actually happens, in a system that includes a scientist.
That depends what you mean by an “experiment.” If you divide a set of patients into a control group and a test group, and then have the test group smoke a pack of cigarettes per day, that is an “experiment” to me, one that is represented by an intervention (because we are forcing the test group to smoke regardless of what they would naturally want to do).
Observing that the test group is much more likely to develop cancer would lead me to conclude that the graph
smoking → cancer
is a causal graph rather than merely a statistical graph.
If we do not perform the above experiment due to ethical reasons, but instead use observational data on smokers, we have to worry about confounders, like Fisher did. We also have to worry, because we are implicitly linking that data with counterfactual situations (what would have happened if those guys we observed were forced to smoke). This linking isn’t “free,” there are assumptions operating in the background. Assumptions expressed in a language that can talk about counterfactual situations.