Abstract: We prove a version of the Good Regulator Theorem for a regulator with imperfect knowledge of its environment aiming to minimize the entropy of an output.
The Original Good Regulator Theorem
The original Good Regulator Theorem (from the 1970 paper of Conant and Ashby) concerns a setup involving three random variables: a ‘System’ S, an ‘Output’ Z and a ‘Regulator’ R. They are related by the following Bayes net:
The regulator receives an input from the system S and then takes an action which is represented by the random variable R. This action interacts with the system variable S to produce the output Z. The function which maps the (R,S) pair to Z is assumed to be deterministic. The aim of the regulator is to minimize the entropy of the output Z. In order to do this, it must use a ‘policy’ which is characterized by a conditional probability distribution P(R|S). When discussing the regulator policy, we often anthropomorphize the regulator and talk about it ‘choosing’ actions, but this is just used as an intuitive way of discussing the conditional probability distribution P(R|S). For example, we might say ‘the regulator will always choose action rA when presented with system state sA’ which means that P(R=rA|S=sA)=1.
The theorem shows that a regulator which achieves the lowest output entropy and is not ‘unnecessarily complex’ must be a deterministic function of the System ie. the conditional probabilities P(R|S) must all be 0 or 1. If you want more details, I have written an explainer of the theorem.
The setup used in the original Good Regulator Theorem is restrictive in two important ways.
Firstly, it assumes that the regulator has perfect knowledge of the system state. In other words, it assumes that the regulator can assign an action (or probabilistic policy) for every possible system state si. This means that the theorem does not cover cases where the regulator does not have full information about the system state. For example, if the system state was a bit string of length ten, but the regulator could only ‘see’ the first five bits of the string, this setup would not be covered by the original Good Regulator Theorem. For the setup to be covered by the original theorem, the regulator would have to be able to ‘see’ the full ten bits of the system and choose an action (or probabilistic mixture of actions) for each unique system string.
Secondly, the setup assumes that the output Z is a deterministic function of R and S, leaving no room for randomness anywhere in the whole Bayes net, except for the initial probability distribution over values of S.
Both of these restrictions can be lifted by considering an alternative setup:
In this setup, the variable X represents the part of the environment that the regulator can observe, which may not be the full environment. This variable feeds into the system state S, but with some randomness, allowing for other, unobserved variables to affect the system state. Finally, the output Z is determined by S and R. Without loss of generality, we can assume that Z depends deterministically on S and R, since we could always extend the definition of S to include some random noise. If we wish to explicitly represent some ‘noise’ variable feeding into S, we could draw the diagram like this:
but we will keep the noise variable implicit by allowing for some randomness in the X→S dependency.
Wentworth’s ‘Little Less Silly’ Good Regulator Theorem
Using the ‘imperfect knowledge’ framework described by the diagram above (where the regulator has knowledge of X but not S), John Wentworth proved his ‘Little Less Silly’ version of the Good Regulator Theorem. Instead of considering entropy minimization as the target of optimization, he used expected utility maximization (where the utility function is defined over the outcomes Z). He showed that, in this setup, a regulator which maximized expected utility and contained no unnecessary randomness was a deterministic function of the conditional posterior distribution P(S|X=x)[1]. In other words, the regulator’s action would be a deterministic function of X, with the added condition that it would pick the same action for two X-values x1 and x2 if they lead to the same posterior distribution of system values (ie. P(S|X=x1)=P(S|X=x2)). The regulator would only pick different actions for two X-values if they lead to different posterior distributions. This allows you to say that a regulator which maximizes expected utility and is not unnecessarily complex is in some sense equivalent to a regulator which calculates (or ‘models’) the distribution P(S|X).
Note that this theorem is behavioural; it concerns the relationship between the inputs and outputs of the regulator. It is not structural; ie. it does not tell us anything about what the internal structure of the regulator must be, except that the internal structure must be compatible with the input-output behaviour. I mention this because the Good Regulator Theorem is sometimes discussed in the context of the Agent-like Structure Problem which asks ‘does agent-like behaviour imply agent-like structure?‘. While the theorem might help us to characterise agent-like behaviour, on its own it does not tell us anything about structure. This is true of the original Good Regulator Theorem, Wentworth’s ‘little less silly’ version, and the version presented in this post.
In his post, after proving this theorem, John Wentworth then makes the following note:
Important note: I am not sure whether this result holds for minimum entropy. It is a qualitatively different problem, and in some ways more interesting—it’s more like an embedded agency problem, since decisions for one input-value can influence the optimal choice given other X-values.
He then leaves this hanging before going on to prove the more complex result which people have have called the ‘Gooder Regulator’ theorem.
Here’s the issue he is pointing at, as I understand it. With utility maximization, one can choose a policy which maximizes expected utility in each ‘branch’ of possible outcomes and this policy will maximize the overall expected utility. However, for entropy minimization, this is not the case. You can ‘mix’ several low entropy distributions and get an overall distribution which is high entropy. This means that an entropy minimization policy must take into account its actions in other ‘branches’ in a way that utility maximization does not need to. Having said this, I don’t think that this is an insurmountable obstacle. After all, the original Good Regulator Theorem concerns entropy minimization and this setup is only a small adjustment to the one used in the original paper.
As far as I can tell, no-one has written down a Good Regulator Theorem for minimizing entropy with imperfect knowledge of the system state[2]. I think it’s do-able without running into any embedded-agency-like problems. Here’s my attempt.
Theorem Statement
Consider a regulator operating the in the ‘imperfect knowledge’ setup described above.
Call a regulator ‘good’ if it satisfies the two following criteria:
It achieves the minimum possible output entropy.
It contains no ‘unnecessary complexity’.
The term ‘unnecessary complexity’ is used by the authors of the original Good Regulator paper to mean unnecessary randomness. If a regulator randomizes between two actions which lead to the same result, then this is considered unnecessary randomness.
We will prove that a good regulator R will be a deterministic function of its input X. Furthermore, if two X-values x1 and x2 lead to the same posterior distribution over S, ie. P(S|X=x1)=P(S|X=x2), then a good regulator will choose the same action for those two inputs. In other words: a good regulator chooses a different R-value for two X-values only if the two X-values lead to different posterior distributions over S. We can say that such a regulator is a deterministic function of P(S|X=x). In this sense, we can say that a good regulator is equivalent (in terms of input-output behaviour) to one which is modelling the distribution over system states P(S|X).
Proof Sketch
This proof follows approximately the same structure as the original Good Regulator Theorem and is sketched out below.
Suppose we have a regulator which achieves the minimum possible entropy for some setup. We then prove the following: if, for a given X-value (call it x0) the regulator takes two different actions with non-zero probability (say P(R=ra|X=x0)>0 and P(R=rb|X=x0)>0), then both of those R-values must lead to the same posterior distribution over Z, conditional on x0. A corollary of this is that if both of the R-values do not lead to the same posterior distribution over Z, conditional on x0, then we can construct a regulator which achieves a lower entropy. Then, we invoke the requirement that a good regulator must contain no unnecessary complexity to show that a good regulator must be a deterministic function of its input X. Finally, we show why a good regulator must be a deterministic function not only of X, but of the posterior distribution P(S|X=x). If a regulator is a deterministic function of X but not of the posterior distribution, then it is either ‘unnecessarily complex’ or fails to achieve minimal entropy.
This is analogous to structure of the original Good Regulator Theorem. In the original paper, the main lemma states that if for a given S-value an entropy-minimizing regulator takes two different actions with non-zero probability then both actions (along with that S-value) lead to the same Z-value. The difference here is that Z is a probabilistic, not a deterministic function of X (the regulator input) and R (the regulator output).
The original Good Regulator authors make the assumption that the entropy-minimizing distribution of Z is unique to simplify the proof. As far as I can tell, this assumption is unnecessary and the proof presented below does not require it. The proof below shows something like: ‘If R is not a deterministic function of X, then it does not achieve minimal entropy. Furthermore if R is not also a deterministic function of P(S|X=x), then it is unnecessarily complex’. This statement is true even if there are multiple possible entropy-minimizing policies or Z-distributions
Proof
Suppose we have a regulator as described above which takes two different actions ra and rb each with non-zero probability when presented with the X-value x0. We will not concern ourselves with exactly what the regulator does when presented with a different X-value but we will assume it has a well-defined policy. We can write the overall probability distribution over outputs as:
P(Z)=P(Z|X=x0)P(X=x0)+P(Z|X≠x0)P(X≠x0).
Based on the Bayes Net of the setup, we can break down the P(Z|X=x0) to explicitly represent all variable dependencies:
P(Z|X=x0)=qP(Z|X=x0,R=ra)+(1−q)P(Z|X=x0,R=rb),
where q=P(R=ra|X=x0) and P(Z|X=x0,R=ra) is obtained by summing over S-values:
P(Z|X=x0,R=ra)=∑iP(Z|X=x0,R=ra,S=si)P(S=si|X=x0)
with a similar expression for P(Z|X=x0,R=rb).
To reduce the size of our equations let us write Pa(Z)=P(Z|X=x0,R=ra) and Pb(Z)=P(Z|X=x0,R=rb). We can then write the overall distribution as:
P(Z)=P(Z|X≠x0)P(X≠x0)+[qPa(Z)+(1−q)Pb(Z)]P(X=x0).
If we write P(X=x0) as Px and P(X≠x0)as (1−Px), we can now re-group this expression as follows:
This equation is cumbersome but has a fairly straightforward interpretation. The distribution over Z can be be viewed as a mixture of two probability distributions. With probability q, Z is drawn from a distribution which itself is a mixture of P(Z|X≠x0) and Pa(Z). With probability (1−q), Z is drawn from a distribution which is a mixture of Pb(Z) and P(Z|X≠x0). To simplify equations, let us write:
This allows us to write the overall distribution as a mixture in a much clearer way:
P(Z)=qQa(Z)+(1−q)Qb(Z).
Now, we invoke the concavity of entropy, which tells us that the overall entropy of P(Z) must be greater than or equal to the weighted sum of the entropies of Qa(Z) and Qb(Z). Recall q=P(R=ra|X=x0) is the probability that the regulator chooses action ra when presented with observation x0 and that is to find the regulator policy which minimizes the output entropy of Z. With this in mind, we can consider a few different possibilities:
The entropy of Qa(Z) is less than the entropy of Qb(Z). In this case, the output entropy is only minimized when q=1. In other words, this means that the entropy is minimized when the regulator policy picks action ra with probability 1 when presented with input x0.
The entropy of Qb(Z) is less than the entropy of Qa(Z). In this case, the output entropy is minimized when the regulator picks action rb with probability 1, when presented with x0.
The entropy of Qa(Z) is equal to the entropy of Qb(Z), but the distributions are different. In this case, the output entropy is minimized by a deterministic policy which picks ra with probability 1 when presented with x0or a policy which picks rb with probability 1 when presented with x0. Either policy will achieve the minimum possible entropy, but any probabilistic mixture of the two will have a higher entropy.
The two distributions Qa(Z) and Qb(Z) are the same. In this case, any choice of q will lead to the same output entropy.
By considering the four exhaustive cases above, we have therefore proved that that a regulator can only achieve minimum entropy by randomizing between actions if both of those actions lead to the same distribution over Z. Furthermore, we can enforce the condition that the regulator policy should contain no ‘unnecessary’ complexity or randomness. Then, even in the case where Qa(Z) and Qb(Z) are the same, the regulator would choose a deterministic policy.
Thus, we have proved that any regulator which achieves minimum entropy and contains no unnecessary randomness will be a deterministic function of its input X. This is analogous to the original good regulator theorem.
In what sense is this a ‘model’?
The authors of the original Good Regulator Theorem claim that it shows that a regulator must be ‘modelling’ its environment. In what sense is this true for the theorem we have just proved?
Firstly, there is the (slightly trivial) sense in which the regulator must be taking in information from its environment and using it to choose an action. We could think of this as the regulator modelling the part of the environment represented by the variable X. But this is not very satisfying, since the deterministic function which the ‘good’ regulator applies may be very trivial (for example, in some setups a ‘good’ regulator might just be a function which maps all X-values to the same regulator output).
Then, in closer analogy to John Wentworth’s work we could say that a good regulator is equivalent to a regulator which is a deterministic function of the posterior probability distribution P(S|X). In other words, a good regulator is equivalent to a one which, upon observing X=xi, calculates the distribution P(S|X=xi) and throws away information about X and makes its decision purely based on the distribution P(S|X=xi). We will unpack this idea a little bit more in the next section.
A Deterministic Function of the Posterior Distribution
We have already shown that a good regulator is a deterministic function of is input X. Furthermore, the posterior distributionP(S|X=xi) is also a deterministic function[3] of xi (though S itself is not necessarily a deterministic function of xi). Now, suppose that the regulator was a deterministic function of X, but not equivalent to a deterministic function of P(S|X=xi). The only way this could be the case would be if there are some xi-values which lead to the same posterior distributions (eg. P(S|X=x1)=P(S|X=x2) ) but the regulator chooses different R-values for each X-value. If choosing different R-values for each X-value with the same posterior S-distribution leads to different output Z-distribution, then this regulator cannot be optimal, due to a concavity of entropy argument, similar to the one in the above section. If choosing different R-values for each X-value with the same posterior S-distribution does not lead to different output Z-distributions, then we can argue that choosing different R-values constitutes ‘unnecessary complexity’ and require that a ‘good’ regulator chooses one R-value consistently for upon observing both X=x1 and X=x2. This regulator would therefore be equivalent to one which received xi as a input and calculated/modelled the distribution P(S|X=xi) and then made its decision purely based on this distribution, instead of the xi value it received.
In this post we will use the phrase ‘posterior distribution’ to refer to the conditional probability distribution P(S|X=x). This follows John Wentworth’s use of this phrase in his Good Regulator Theorem post. The word ‘posterior’ is usually used in the context of Bayesian updating and has the connotation that the regulator is ‘updating its belief’ about the system distribution, given the observation X=x. While it is certainty possible that the regulator has a ‘belief’ which it is ‘updating’, this is not necessary for the theorem to hold. You can safely think of the ‘posterior’ distribution P(S|X=x) as just a conditional distribution without missing anything important.
John Wentworth correctly points out that the proof of the original Good Regulator Theorem (with entropy minimization) holds for an imperfect knowledge setup like this, provided that S is a deterministic function of X (ie.X contains all information needed to perfectly reconstruct S). This is true, but does not cover the more interesting case, covered here, where S cannot be fully reconstructed from X alone. From ‘Fixing the Good Regulator Theorem’:
The whole proof actually works just fine with these two assumptions, and I think this is what Conant & Ashby originally intended. The end result is that the regulator output R must be a deterministic function of S, even if the regulator only takes X as input, not S itself (assuming S is a deterministic function of X, i.e. the regulator has enough information to perfectly reconstruct S).
There are a couple of functions with different type signatures discussed in this paragraph so I’ll briefly clarify them here. When I say that the distribution is a function of X, what I mean is that there is a function f(x)=P(S|X=x) whose domain contains all possible X-values and whose output is the distribution P(S|X=x). Later, when we talk about R being a deterministic function of the distribution P(S|X=x), this means that R=g(Q) where the domain of g includes the distributions P(S|X=x) for all x and the range of g is the set of possible R-values (or ‘actions’).
A Generalization of the Good Regulator Theorem
This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. Thanks to Alex for reading and commenting on the draft.
Abstract: We prove a version of the Good Regulator Theorem for a regulator with imperfect knowledge of its environment aiming to minimize the entropy of an output.
The Original Good Regulator Theorem
The original Good Regulator Theorem (from the 1970 paper of Conant and Ashby) concerns a setup involving three random variables: a ‘System’ S, an ‘Output’ Z and a ‘Regulator’ R. They are related by the following Bayes net:
The regulator receives an input from the system S and then takes an action which is represented by the random variable R. This action interacts with the system variable S to produce the output Z. The function which maps the (R,S) pair to Z is assumed to be deterministic. The aim of the regulator is to minimize the entropy of the output Z. In order to do this, it must use a ‘policy’ which is characterized by a conditional probability distribution P(R|S). When discussing the regulator policy, we often anthropomorphize the regulator and talk about it ‘choosing’ actions, but this is just used as an intuitive way of discussing the conditional probability distribution P(R|S). For example, we might say ‘the regulator will always choose action rA when presented with system state sA’ which means that P(R=rA|S=sA)=1.
The theorem shows that a regulator which achieves the lowest output entropy and is not ‘unnecessarily complex’ must be a deterministic function of the System ie. the conditional probabilities P(R|S) must all be 0 or 1. If you want more details, I have written an explainer of the theorem.
The setup used in the original Good Regulator Theorem is restrictive in two important ways.
Firstly, it assumes that the regulator has perfect knowledge of the system state. In other words, it assumes that the regulator can assign an action (or probabilistic policy) for every possible system state si. This means that the theorem does not cover cases where the regulator does not have full information about the system state. For example, if the system state was a bit string of length ten, but the regulator could only ‘see’ the first five bits of the string, this setup would not be covered by the original Good Regulator Theorem. For the setup to be covered by the original theorem, the regulator would have to be able to ‘see’ the full ten bits of the system and choose an action (or probabilistic mixture of actions) for each unique system string.
Secondly, the setup assumes that the output Z is a deterministic function of R and S, leaving no room for randomness anywhere in the whole Bayes net, except for the initial probability distribution over values of S.
Both of these restrictions can be lifted by considering an alternative setup:
In this setup, the variable X represents the part of the environment that the regulator can observe, which may not be the full environment. This variable feeds into the system state S, but with some randomness, allowing for other, unobserved variables to affect the system state. Finally, the output Z is determined by S and R. Without loss of generality, we can assume that Z depends deterministically on S and R, since we could always extend the definition of S to include some random noise. If we wish to explicitly represent some ‘noise’ variable feeding into S, we could draw the diagram like this:
but we will keep the noise variable implicit by allowing for some randomness in the X→S dependency.
Wentworth’s ‘Little Less Silly’ Good Regulator Theorem
Using the ‘imperfect knowledge’ framework described by the diagram above (where the regulator has knowledge of X but not S), John Wentworth proved his ‘Little Less Silly’ version of the Good Regulator Theorem. Instead of considering entropy minimization as the target of optimization, he used expected utility maximization (where the utility function is defined over the outcomes Z). He showed that, in this setup, a regulator which maximized expected utility and contained no unnecessary randomness was a deterministic function of the conditional posterior distribution P(S|X=x)[1]. In other words, the regulator’s action would be a deterministic function of X, with the added condition that it would pick the same action for two X-values x1 and x2 if they lead to the same posterior distribution of system values (ie. P(S|X=x1)=P(S|X=x2)). The regulator would only pick different actions for two X-values if they lead to different posterior distributions. This allows you to say that a regulator which maximizes expected utility and is not unnecessarily complex is in some sense equivalent to a regulator which calculates (or ‘models’) the distribution P(S|X).
Note that this theorem is behavioural; it concerns the relationship between the inputs and outputs of the regulator. It is not structural; ie. it does not tell us anything about what the internal structure of the regulator must be, except that the internal structure must be compatible with the input-output behaviour. I mention this because the Good Regulator Theorem is sometimes discussed in the context of the Agent-like Structure Problem which asks ‘does agent-like behaviour imply agent-like structure?‘. While the theorem might help us to characterise agent-like behaviour, on its own it does not tell us anything about structure. This is true of the original Good Regulator Theorem, Wentworth’s ‘little less silly’ version, and the version presented in this post.
In his post, after proving this theorem, John Wentworth then makes the following note:
He then leaves this hanging before going on to prove the more complex result which people have have called the ‘Gooder Regulator’ theorem.
Here’s the issue he is pointing at, as I understand it. With utility maximization, one can choose a policy which maximizes expected utility in each ‘branch’ of possible outcomes and this policy will maximize the overall expected utility. However, for entropy minimization, this is not the case. You can ‘mix’ several low entropy distributions and get an overall distribution which is high entropy. This means that an entropy minimization policy must take into account its actions in other ‘branches’ in a way that utility maximization does not need to. Having said this, I don’t think that this is an insurmountable obstacle. After all, the original Good Regulator Theorem concerns entropy minimization and this setup is only a small adjustment to the one used in the original paper.
As far as I can tell, no-one has written down a Good Regulator Theorem for minimizing entropy with imperfect knowledge of the system state[2]. I think it’s do-able without running into any embedded-agency-like problems. Here’s my attempt.
Theorem Statement
Consider a regulator operating the in the ‘imperfect knowledge’ setup described above.
Call a regulator ‘good’ if it satisfies the two following criteria:
It achieves the minimum possible output entropy.
It contains no ‘unnecessary complexity’.
The term ‘unnecessary complexity’ is used by the authors of the original Good Regulator paper to mean unnecessary randomness. If a regulator randomizes between two actions which lead to the same result, then this is considered unnecessary randomness.
We will prove that a good regulator R will be a deterministic function of its input X. Furthermore, if two X-values x1 and x2 lead to the same posterior distribution over S, ie. P(S|X=x1)=P(S|X=x2), then a good regulator will choose the same action for those two inputs. In other words: a good regulator chooses a different R-value for two X-values only if the two X-values lead to different posterior distributions over S. We can say that such a regulator is a deterministic function of P(S|X=x). In this sense, we can say that a good regulator is equivalent (in terms of input-output behaviour) to one which is modelling the distribution over system states P(S|X).
Proof Sketch
This proof follows approximately the same structure as the original Good Regulator Theorem and is sketched out below.
Suppose we have a regulator which achieves the minimum possible entropy for some setup. We then prove the following: if, for a given X-value (call it x0) the regulator takes two different actions with non-zero probability (say P(R=ra|X=x0)>0 and P(R=rb|X=x0)>0), then both of those R-values must lead to the same posterior distribution over Z, conditional on x0. A corollary of this is that if both of the R-values do not lead to the same posterior distribution over Z, conditional on x0, then we can construct a regulator which achieves a lower entropy. Then, we invoke the requirement that a good regulator must contain no unnecessary complexity to show that a good regulator must be a deterministic function of its input X. Finally, we show why a good regulator must be a deterministic function not only of X, but of the posterior distribution P(S|X=x). If a regulator is a deterministic function of X but not of the posterior distribution, then it is either ‘unnecessarily complex’ or fails to achieve minimal entropy.
This is analogous to structure of the original Good Regulator Theorem. In the original paper, the main lemma states that if for a given S-value an entropy-minimizing regulator takes two different actions with non-zero probability then both actions (along with that S-value) lead to the same Z-value. The difference here is that Z is a probabilistic, not a deterministic function of X (the regulator input) and R (the regulator output).
The original Good Regulator authors make the assumption that the entropy-minimizing distribution of Z is unique to simplify the proof. As far as I can tell, this assumption is unnecessary and the proof presented below does not require it. The proof below shows something like: ‘If R is not a deterministic function of X, then it does not achieve minimal entropy. Furthermore if R is not also a deterministic function of P(S|X=x), then it is unnecessarily complex’. This statement is true even if there are multiple possible entropy-minimizing policies or Z-distributions
Proof
Suppose we have a regulator as described above which takes two different actions ra and rb each with non-zero probability when presented with the X-value x0. We will not concern ourselves with exactly what the regulator does when presented with a different X-value but we will assume it has a well-defined policy. We can write the overall probability distribution over outputs as:
P(Z)=P(Z|X=x0)P(X=x0)+P(Z|X≠x0)P(X≠x0).Based on the Bayes Net of the setup, we can break down the P(Z|X=x0) to explicitly represent all variable dependencies:
P(Z|X=x0)=qP(Z|X=x0,R=ra)+(1−q)P(Z|X=x0,R=rb),where q=P(R=ra|X=x0) and P(Z|X=x0,R=ra) is obtained by summing over S-values:
P(Z|X=x0,R=ra)=∑iP(Z|X=x0,R=ra,S=si)P(S=si|X=x0)with a similar expression for P(Z|X=x0,R=rb).
To reduce the size of our equations let us write Pa(Z)=P(Z|X=x0,R=ra) and Pb(Z)=P(Z|X=x0,R=rb). We can then write the overall distribution as:
P(Z)=P(Z|X≠x0)P(X≠x0)+[qPa(Z)+(1−q)Pb(Z)]P(X=x0).If we write P(X=x0) as Px and P(X≠x0)as (1−Px), we can now re-group this expression as follows:
P(Z)=q[P(Z|X≠x0)(1−Px)+Pa(Z)Px]+(1−q)[P(Z|X≠x0)(1−Px)+Pb(Z)Px].This equation is cumbersome but has a fairly straightforward interpretation. The distribution over Z can be be viewed as a mixture of two probability distributions. With probability q, Z is drawn from a distribution which itself is a mixture of P(Z|X≠x0) and Pa(Z). With probability (1−q), Z is drawn from a distribution which is a mixture of Pb(Z) and P(Z|X≠x0). To simplify equations, let us write:
Qa(Z)=P(Z|X≠x0)(1−Px)+Pa(Z)Px,Qb(Z)=P(Z|X≠x0)(1−Px)+Pb(Z)Px.This allows us to write the overall distribution as a mixture in a much clearer way:
P(Z)=qQa(Z)+(1−q)Qb(Z).Now, we invoke the concavity of entropy, which tells us that the overall entropy of P(Z) must be greater than or equal to the weighted sum of the entropies of Qa(Z) and Qb(Z). Recall q=P(R=ra|X=x0) is the probability that the regulator chooses action ra when presented with observation x0 and that is to find the regulator policy which minimizes the output entropy of Z. With this in mind, we can consider a few different possibilities:
The entropy of Qa(Z) is less than the entropy of Qb(Z). In this case, the output entropy is only minimized when q=1. In other words, this means that the entropy is minimized when the regulator policy picks action ra with probability 1 when presented with input x0.
The entropy of Qb(Z) is less than the entropy of Qa(Z). In this case, the output entropy is minimized when the regulator picks action rb with probability 1, when presented with x0.
The entropy of Qa(Z) is equal to the entropy of Qb(Z), but the distributions are different. In this case, the output entropy is minimized by a deterministic policy which picks ra with probability 1 when presented with x0 or a policy which picks rb with probability 1 when presented with x0. Either policy will achieve the minimum possible entropy, but any probabilistic mixture of the two will have a higher entropy.
The two distributions Qa(Z) and Qb(Z) are the same. In this case, any choice of q will lead to the same output entropy.
By considering the four exhaustive cases above, we have therefore proved that that a regulator can only achieve minimum entropy by randomizing between actions if both of those actions lead to the same distribution over Z. Furthermore, we can enforce the condition that the regulator policy should contain no ‘unnecessary’ complexity or randomness. Then, even in the case where Qa(Z) and Qb(Z) are the same, the regulator would choose a deterministic policy.
Thus, we have proved that any regulator which achieves minimum entropy and contains no unnecessary randomness will be a deterministic function of its input X. This is analogous to the original good regulator theorem.
In what sense is this a ‘model’?
The authors of the original Good Regulator Theorem claim that it shows that a regulator must be ‘modelling’ its environment. In what sense is this true for the theorem we have just proved?
Firstly, there is the (slightly trivial) sense in which the regulator must be taking in information from its environment and using it to choose an action. We could think of this as the regulator modelling the part of the environment represented by the variable X. But this is not very satisfying, since the deterministic function which the ‘good’ regulator applies may be very trivial (for example, in some setups a ‘good’ regulator might just be a function which maps all X-values to the same regulator output).
Then, in closer analogy to John Wentworth’s work we could say that a good regulator is equivalent to a regulator which is a deterministic function of the posterior probability distribution P(S|X). In other words, a good regulator is equivalent to a one which, upon observing X=xi, calculates the distribution P(S|X=xi) and throws away information about X and makes its decision purely based on the distribution P(S|X=xi). We will unpack this idea a little bit more in the next section.
A Deterministic Function of the Posterior Distribution
We have already shown that a good regulator is a deterministic function of is input X. Furthermore, the posterior distribution P(S|X=xi) is also a deterministic function[3] of xi (though S itself is not necessarily a deterministic function of xi). Now, suppose that the regulator was a deterministic function of X, but not equivalent to a deterministic function of P(S|X=xi). The only way this could be the case would be if there are some xi-values which lead to the same posterior distributions (eg. P(S|X=x1)=P(S|X=x2) ) but the regulator chooses different R-values for each X-value. If choosing different R-values for each X-value with the same posterior S-distribution leads to different output Z-distribution, then this regulator cannot be optimal, due to a concavity of entropy argument, similar to the one in the above section. If choosing different R-values for each X-value with the same posterior S-distribution does not lead to different output Z-distributions, then we can argue that choosing different R-values constitutes ‘unnecessary complexity’ and require that a ‘good’ regulator chooses one R-value consistently for upon observing both X=x1 and X=x2. This regulator would therefore be equivalent to one which received xi as a input and calculated/modelled the distribution P(S|X=xi) and then made its decision purely based on this distribution, instead of the xi value it received.
In this post we will use the phrase ‘posterior distribution’ to refer to the conditional probability distribution P(S|X=x). This follows John Wentworth’s use of this phrase in his Good Regulator Theorem post. The word ‘posterior’ is usually used in the context of Bayesian updating and has the connotation that the regulator is ‘updating its belief’ about the system distribution, given the observation X=x. While it is certainty possible that the regulator has a ‘belief’ which it is ‘updating’, this is not necessary for the theorem to hold. You can safely think of the ‘posterior’ distribution P(S|X=x) as just a conditional distribution without missing anything important.
John Wentworth correctly points out that the proof of the original Good Regulator Theorem (with entropy minimization) holds for an imperfect knowledge setup like this, provided that S is a deterministic function of X (ie.X contains all information needed to perfectly reconstruct S). This is true, but does not cover the more interesting case, covered here, where S cannot be fully reconstructed from X alone. From ‘Fixing the Good Regulator Theorem’:
There are a couple of functions with different type signatures discussed in this paragraph so I’ll briefly clarify them here. When I say that the distribution is a function of X, what I mean is that there is a function f(x)=P(S|X=x) whose domain contains all possible X-values and whose output is the distribution P(S|X=x). Later, when we talk about R being a deterministic function of the distribution P(S|X=x), this means that R=g(Q) where the domain of g includes the distributions P(S|X=x) for all x and the range of g is the set of possible R-values (or ‘actions’).