Causal inference using the algorithmic Markov condition (Janzing and Schölkopf, 2008) replaces conditional independences between random variables, which define the structure of causal graphs, with algorithmic conditional independences between bit strings.
Conditional probabilities between variables become conditional complexities between strings, i.e. K(x|y) is the length of the shortest program that can generate the string x from y. Similarly, algorithmic mutual information I(x:y) is the amount of information that can be omitted in defining a string y given a shortest compressor for string x, I(x:y) = K(y) - K(y|x*). K(x,y) is the complexity of the concatenation of two strings x and y. These lead naturally to a definition of algorithmic conditional independence as I(x:y|z) = K(x|z) + K(y|z) - K(x,y|z) = 0 , where equality is defined up to the standard additive constant.
Then a lot of sexy, confusing proofs happen. When the dust settles, it looks like if you take some strings describing observations, interpret them as nodes in a graph, and “factor” so that a certain algorithmic Markov condition holds (every node string should be algorithmically independent of its non-descendant node strings given the optimal compressor of its parents’ node strings), then every node can be computed by an O(1) program run on a Turing machine, with the node’s parents and a noise term as input (with each node’s noise string being jointly independent of the others).
Notably, this means that if we make two observations which were “generated from their parents by the same complex rule”, then we can “postulate another causal link between the nodes that explains the similarity of mechanisms”. They say “complex rule” because the mutual algorithmic information between simple information strings, like some digits of pi, will be swallowed up by additive constants. Which all seems very close to rediscovering TDT.
There’s more to the paper, but that’s the tasty bit, so the summary ends here.
Combining causality with algorithmic information theory
Warning: maths.
Causal inference using the algorithmic Markov condition (Janzing and Schölkopf, 2008) replaces conditional independences between random variables, which define the structure of causal graphs, with algorithmic conditional independences between bit strings.
Conditional probabilities between variables become conditional complexities between strings, i.e. K(x|y) is the length of the shortest program that can generate the string x from y. Similarly, algorithmic mutual information I(x:y) is the amount of information that can be omitted in defining a string y given a shortest compressor for string x, I(x:y) = K(y) - K(y|x*). K(x,y) is the complexity of the concatenation of two strings x and y. These lead naturally to a definition of algorithmic conditional independence as I(x:y|z) = K(x|z) + K(y|z) - K(x,y|z) = 0 , where equality is defined up to the standard additive constant.
Then a lot of sexy, confusing proofs happen. When the dust settles, it looks like if you take some strings describing observations, interpret them as nodes in a graph, and “factor” so that a certain algorithmic Markov condition holds (every node string should be algorithmically independent of its non-descendant node strings given the optimal compressor of its parents’ node strings), then every node can be computed by an O(1) program run on a Turing machine, with the node’s parents and a noise term as input (with each node’s noise string being jointly independent of the others).
Notably, this means that if we make two observations which were “generated from their parents by the same complex rule”, then we can “postulate another causal link between the nodes that explains the similarity of mechanisms”. They say “complex rule” because the mutual algorithmic information between simple information strings, like some digits of pi, will be swallowed up by additive constants. Which all seems very close to rediscovering TDT.
There’s more to the paper, but that’s the tasty bit, so the summary ends here.