In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict. Z measures (expected) accuracy of prediction, so to make good predictions with minimal info kept around from the data, we need a model. (Other applications of the theorem could of course say other things, but this seems like the one we probably want most.)
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important. Without that, it’s much less plausible that every change in S changes the optimal strategy. With information coming in two steps, we have to keep around all the information from the first step which could be relevant to any of the possible games; our “strategy” includes strategies for the sub-games resulting from each possible value of the second input, and a change in any one of those is enough.
In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict.
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
If S is derived from X, then “information in S” = “information in X relevant to S”
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
Fair point. I kind of wanted to abstract away this detail in the operationalization of “relevant”, but it does seem misleading as stated. Changed to “important for optimal performance”.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important.
I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I’ve added the sentence “Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S” to the second point to clarify.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
That’s an (implicit) assumption in Conant & Ashby’s setup, I explicitly remove that constraint in the “Minimum Entropy → Maximum Expected Utility and Imperfect Knowledge” section. (That’s the “imperfect knowledge” part.)
If S is derived from X, then “information in S” = “information in X relevant to S”
Same here. Once we relax the “S is a deterministic function of X” constraint, the “information in X relevant to S” is exactly the posterior distribution (s↦P[S=s|X]), which is why that distribution comes up so much in the later sections.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
Yeah, the number of necessary nontrivial pieces is… just a little to high to not have to worry about inductive distance.
Good enough. I don’t love it, but I also don’t see easy ways to improve it without making it longer and more technical (which would mean it’s not strictly an improvement). Maybe at some point I’ll take the time to make a shorter and less math-dense writeup.
Four things I’d change:
In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict. Z measures (expected) accuracy of prediction, so to make good predictions with minimal info kept around from the data, we need a model. (Other applications of the theorem could of course say other things, but this seems like the one we probably want most.)
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important. Without that, it’s much less plausible that every change in S changes the optimal strategy. With information coming in two steps, we have to keep around all the information from the first step which could be relevant to any of the possible games; our “strategy” includes strategies for the sub-games resulting from each possible value of the second input, and a change in any one of those is enough.
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
If S is derived from X, then “information in S” = “information in X relevant to S”
Fair point. I kind of wanted to abstract away this detail in the operationalization of “relevant”, but it does seem misleading as stated. Changed to “important for optimal performance”.
I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I’ve added the sentence “Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S” to the second point to clarify.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
That’s an (implicit) assumption in Conant & Ashby’s setup, I explicitly remove that constraint in the “Minimum Entropy → Maximum Expected Utility and Imperfect Knowledge” section. (That’s the “imperfect knowledge” part.)
Same here. Once we relax the “S is a deterministic function of X” constraint, the “information in X relevant to S” is exactly the posterior distribution (s↦P[S=s|X]), which is why that distribution comes up so much in the later sections.
Yeah, the number of necessary nontrivial pieces is… just a little to high to not have to worry about inductive distance.
… That’ll teach me to skim through the math in posts I’m trying to summarize. I’ve edited the summary, lmk if it looks good now.
Good enough. I don’t love it, but I also don’t see easy ways to improve it without making it longer and more technical (which would mean it’s not strictly an improvement). Maybe at some point I’ll take the time to make a shorter and less math-dense writeup.