Rohin Shah comments on The Blackwell order as a formalization of knowledge

Rohin Shah 15 Sep 2021 12:53 UTC
LW: 6 AF: 4
AF
Having studied the paradoxical results, I don’t think they are paradoxical for particularly interesting reasons. (Which is not to say that the paper is bad! I don’t expect I would have noticed these problems given just a definition of the Blackwell order! Just that I would recommend against taking this as progress towards “understanding knowledge”, and more like “an elaboration of how not to use Blackwell orders”.)
Proposition 2. There exist random variables S, X1, X2 and a function $f : S \to S^{'}$ from the support of S to a finite set S’ such that the following holds:
1) S and X1 are independent given f(S).
2) $(X_{1} \leftarrow f (S)) < (X_{2} \leftarrow f (S))$
3) $(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)$
This is supposed to be paradoxical because (3) implies that there is a task where it is better to look at $X_{1}$ , (1) implies that any decision rule that uses $X_{1}$ as computed through $S$ can also be translated to an equivalently good decision rule that uses $X_{1}$ as computed through $f (S)$ . So presumably it should still be better to look at $X_{1}$ than $X_{2}$ . However, (2) implies that once you compute through $f (S)$ you’re better off using $X_{2}$ . Paradox!
This argument fails because (3) is a claim about utility functions (tasks) with type signature $S \times A \to R$ , while (2) is a claim about utility functions (tasks) with type signature $f (S) \times A \to R$ . It is possible to have a task that is expressible with $S \times A \to R$ , where $X_{1}$ is better (to make (3) true), but then you can’t express it with $f (S) \times A \to R$ , and so (2) can be true but irrelevant. Their example follows precisely this format.
Proposition 3. There exist random variables S, X and there exists a function $f : S \to S^{'}$ from the support of S to a finite set S’ such that the following holds: $(X \leftarrow f (S)) \cdot (f (S) \leftarrow S) ≰ X \leftarrow S$ .
It seems to me that this is saying “there exists a utility function U and a function f, such that if we change the world to be Markovian according to f, then an agent can do better in the changed world than it could in the original world”. I don’t see why this is particularly interesting or relevant to a formalization of knowledge.
To elaborate, running the actual calculations from their counterexample, we get:
$X \leftarrow S = ⎛ ⎝ \begin{matrix} 1 & \frac{1}{3} & 0 0 & \frac{2}{3} & 1 \end{matrix} ⎞ ⎠$
$f (S) \leftarrow S = (\begin{matrix} 1 & 1 & 0 0 & 0 & 1 \end{matrix})$
$X \leftarrow f (S) = ⎛ ⎝ \begin{matrix} \frac{2}{3} & 0 \frac{1}{3} & 1 \end{matrix} ⎞ ⎠$
$(X \leftarrow f (S)) \cdot (f (S) \leftarrow S) = ⎛ ⎝ \begin{matrix} \frac{2}{3} & \frac{2}{3} & 0 \frac{1}{3} & \frac{1}{3} & 1 \end{matrix} ⎞ ⎠$
If you now compare $(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)$ to $X \leftarrow S$ , that is basically asking the question “let us consider these two ways in which the world could determine which observation I get, and see which of the ways I would prefer”. But $(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)$ is not the way that the world determines how you get the observation $X$ -- that’s exactly given by the $X \leftarrow S$ table above (you can check consistency with the joint probability distribution in the paper). So I don’t really know why you’d do this comparison.
The decomposition $(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)$ can make sense if X is independent of S given f(S), but in this case, the resulting matrix would be exactly equal to $X \leftarrow S$ -- which is as it should be when you are comparing two different views on how the world generates the same observation.
(Perhaps the interpretation under which this is paradoxical is one where $B \leftarrow A$ means that B is computed from A, and matrix multiplication corresponds to composition of computations, but that just seems like a clearly invalid interpretation, since in their example their probability distribution is totally inconsistent with “f(S) is computed from S, and then X is computed from f(S)”.)
More broadly, the framework in this paper provides a definition of “useful” information that allows all possible utility functions. However, it’s always possible to construct a utility function (given their setup) for which any given information is useful, and so the focus on decision-making isn’t going to help you distinguish between information that is and isn’t relevant / useful, in the sense that we normally mean those words.
You could try to fix this by narrowing down the space of utility functions, or introducing limits on computation, or something else along those lines, though then I would be applying these sorts of fixes with traditional Shannon information.