Consider a setting in which we must extract information from some data X to produce M, so that it can later perform some task Z in a system S while only having access to M. We assume that the task depends only on S and not on X (except inasmuch as X affects S). As a concrete example, we might consider gradient descent extracting information from a training dataset (X) and encoding it in neural network weights (M), which can later be used to classify new test images (Z) taken in the world (S) without looking at the training dataset.
The key question: when is it reasonable to call M a model of S? 1. If we assume that this process is done optimally, then M must contain all information in X that is needed for optimal performance on Z. 2. If we assume that every aspect of S is important for optimal performance on Z, then M must contain all information about S that it is possible to get. Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to infer properties of S. 3. If we assume that M contains _no more_ information than it needs, then it must contain exactly the information about S that can be deduced from X.
It seems reasonable to say that in this case we constructed a model M of the system S from the source X “as well as possible”. This post formalizes this conceptual argument and presents it as a refined version of the [Good Regulator Theorem](http://pespmc1.vub.ac.be/books/Conant_Ashby.pdf).
Returning to the neural net example, this argument suggests that since neural networks are trained on data from the world, their weights will encode information about the world and can be thought of as a model of the world.
In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict. Z measures (expected) accuracy of prediction, so to make good predictions with minimal info kept around from the data, we need a model. (Other applications of the theorem could of course say other things, but this seems like the one we probably want most.)
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important. Without that, it’s much less plausible that every change in S changes the optimal strategy. With information coming in two steps, we have to keep around all the information from the first step which could be relevant to any of the possible games; our “strategy” includes strategies for the sub-games resulting from each possible value of the second input, and a change in any one of those is enough.
In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict.
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
If S is derived from X, then “information in S” = “information in X relevant to S”
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
Fair point. I kind of wanted to abstract away this detail in the operationalization of “relevant”, but it does seem misleading as stated. Changed to “important for optimal performance”.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important.
I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I’ve added the sentence “Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S” to the second point to clarify.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
That’s an (implicit) assumption in Conant & Ashby’s setup, I explicitly remove that constraint in the “Minimum Entropy → Maximum Expected Utility and Imperfect Knowledge” section. (That’s the “imperfect knowledge” part.)
If S is derived from X, then “information in S” = “information in X relevant to S”
Same here. Once we relax the “S is a deterministic function of X” constraint, the “information in X relevant to S” is exactly the posterior distribution (s↦P[S=s|X]), which is why that distribution comes up so much in the later sections.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
Yeah, the number of necessary nontrivial pieces is… just a little to high to not have to worry about inductive distance.
Good enough. I don’t love it, but I also don’t see easy ways to improve it without making it longer and more technical (which would mean it’s not strictly an improvement). Maybe at some point I’ll take the time to make a shorter and less math-dense writeup.
Planned summary (edited in response to feedback):
Four things I’d change:
In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict. Z measures (expected) accuracy of prediction, so to make good predictions with minimal info kept around from the data, we need a model. (Other applications of the theorem could of course say other things, but this seems like the one we probably want most.)
On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn’t have access to all the information S contains).
On point (2), it’s not that every aspect of S must be relevant to Z, but rather that every change in S must change our optimal strategy (when optimizing for Z). S could be relevant to Z in ways that don’t change our optimal strategy, and then we wouldn’t need to keep around all the info about S.
The idea that information comes in two steps, with the second input “choosing which game we play”, is important. Without that, it’s much less plausible that every change in S changes the optimal strategy. With information coming in two steps, we have to keep around all the information from the first step which could be relevant to any of the possible games; our “strategy” includes strategies for the sub-games resulting from each possible value of the second input, and a change in any one of those is enough.
I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.
If S is derived from X, then “information in S” = “information in X relevant to S”
Fair point. I kind of wanted to abstract away this detail in the operationalization of “relevant”, but it does seem misleading as stated. Changed to “important for optimal performance”.
I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I’ve added the sentence “Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S” to the second point to clarify.
(In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
That’s an (implicit) assumption in Conant & Ashby’s setup, I explicitly remove that constraint in the “Minimum Entropy → Maximum Expected Utility and Imperfect Knowledge” section. (That’s the “imperfect knowledge” part.)
Same here. Once we relax the “S is a deterministic function of X” constraint, the “information in X relevant to S” is exactly the posterior distribution (s↦P[S=s|X]), which is why that distribution comes up so much in the later sections.
Yeah, the number of necessary nontrivial pieces is… just a little to high to not have to worry about inductive distance.
… That’ll teach me to skim through the math in posts I’m trying to summarize. I’ve edited the summary, lmk if it looks good now.
Good enough. I don’t love it, but I also don’t see easy ways to improve it without making it longer and more technical (which would mean it’s not strictly an improvement). Maybe at some point I’ll take the time to make a shorter and less math-dense writeup.