Fixing The Good Regulator Theorem
Conant & Ashby’s “Every Good Regulator Of A System Must Be A Model Of That System” opens with:
The design of a complex regulator often includes the making of a model of the system to be regulated. The making of such a model has hitherto been regarded as optional, as merely one of many possible ways.
In this paper a theorem is presented which shows, under very broad conditions, that any regulator that is maximally both successful and simple must be isomorphic with the system being regulated. (The exact assumptions are given.) Making a model is thus necessary.
This may be the most misleading title and summary I have ever seen on a math paper. If by “making a model” one means the sort of thing people usually do when model-making—i.e. reconstruct a system’s variables/parameters/structure from some information about them—then Conant & Ashby’s claim is simply false.
What they actually prove is that every regulator which is optimal and contains no unnecessary noise is equivalent to a regulator which first reconstructs the variable-values of the system it’s controlling, then chooses its output as a function of those values (ignoring the original inputs). This does not mean that every such regulator actually reconstructs the variable-values internally. And Ashby & Conant’s proof has several shortcomings even for this more modest claim.
This post presents a modification of the Good Regulator Theorem, and provides a reasonably-general condition under which any optimal minimal regulator must actually construct a model of the controlled system internally. The key idea is conceptually similar to some of the pieces from Risks From Learned Optimization. Basically: an information bottleneck can force the use of a model, in much the same way that an information bottleneck can force the use of a mesa-optimizer. Along the way, we’ll also review the original Good Regulator Theorem and a few minor variants which fix some other problems with the original theorem.
The Original Good Regulator Theorem
We’re interested mainly in this causal diagram:
The main goal is to choose the regulator policy to minimize the entropy of outcome . Later sections will show that this is (roughly) equivalent to expected utility maximization.
After explaining this problem, Conant & Ashby replace it with a different problem, which is not equivalent, and they do not bother to point out that it is not equivalent. They just present roughly the diagram above, and then their actual math implicitly uses this diagram instead:
Rather than choosing a regulator policy , they instead choose a policy . In other words: they implicitly assume that the regulator has perfect information about the system state (and their proof does require this). Later, we’ll talk about how the original theorem generalizes to situations where the regulator does not have perfect information. But for now, I’ll just outline the argument from the paper.
We’ll use two assumptions:
The entropy-minimizing distribution of is unique (i.e. if two different policies both achieve minimum entropy, they both produce the same -distribution). This assumption avoids a bunch of extra legwork which doesn’t really add any substance to the theorem.
is a deterministic function of . Note that we can always make this hold by including any nondeterministic inputs to in itself (though that trick only works if we allow to have imperfect information about , which violates Conant & Ashby’s setup… more on that later).
The main lemma then says: for any optimal regulator , is a deterministic function of . Equivalently: all -values with nonzero probability (for a given -value ) must give the same .
Intuitive argument: if the regulator could pick two different -values (given ), then it can achieve strictly lower entropy by always picking whichever one has higher probability (unconditional on ). Even if the two have the same , always picking one or the other gives strictly lower entropy (since the one we pick will end up with higher once we pick it more often). If the regulator is optimal, then achieving strictly lower entropy is impossible, hence it must always pick the same -value given the same -value. For that argument unpacked into a formal proof, see the paper.
With the lemma nailed down, the last step in Conant & Ashby’s argument is that any remaining nondeterminism in is “unnecessary complexity”. All -values chosen with nonzero probability for a given -value must yield the same anyway, so there’s no reason to have more than one of them. We might as well make a deterministic function of .
Thus: every “simplest” optimal regulator (in the sense that it contains no unnecessary noise) is a “model” of the system (in the sense that the regulator output is a deterministic function of the system state ).
The Problems
There are two immediate problems with this theorem:
The notion of “model” is rather silly—e.g. the system could be quite complex, but the regulator could be an identity function, and it would count as a “model”
The regulator is assumed to have perfect knowledge of the system state (i.e. second diagram rather than first)
Also, though I don’t consider it a “problem” so much as a choice which I think most people here will find more familiar:
The theorem uses entropy-minimization as its notion of optimality, rather than expected-utility-maximization
We’ll address all of these in the next few sections. Making the notion of “model” less silly will take place in two steps—the first step to make it a little less silly while keeping around most of the original’s meaning, the second step to make it a lot less silly while changing the meaning significantly.
Making The Notion Of “Model” A Little Less Silly
The notion of “model” basically says “ is a model of iff is a deterministic function of ”—the idea being that the regulator needs to reconstruct the value of from its inputs in order to choose its outputs. But the proof-as-written-in-the-paper assumes that takes as an input directly (i.e. the regulator chooses ), so really the regulator doesn’t need to “model” in any nontrivial sense in order for to be a deterministic function of . For instance, the regulator could just be the identity function: it takes in and returns . This does not sound like a “model”.
Fortunately, we can make the notion of “model” nontrivial quite easily:
Assume that is a deterministic function of
Assume that the regulator takes as input, rather than itself
The whole proof actually works just fine with these two assumptions, and I think this is what Conant & Ashby originally intended. The end result is that the regulator output must be a deterministic function of , even if the regulator only takes as input, not itself (assuming is a deterministic function of , i.e. the regulator has enough information to perfectly reconstruct ).
Note that this still does not mean that every optimal, not-unnecessarily-nondeterministic regulator actually reconstructs internally. It only shows that any optimal, not-unnecessarily-nondeterministic regulator is equivalent to one which reconstructs and then chooses its output as a deterministic function of (ignoring ).
Minimum Entropy → Maximum Expected Utility And Imperfect Knowledge
I think the theorem is simpler and more intuitive in a maximum expected utility framework, besides being more familiar.
We choose a policy function to maximize expected utility. Since there’s no decision-theoretic funny business in this particular setup, we can maximize for each -value independently:
Key thing to note: when two -values yield the same distribution function , the maximization problem
… is exactly the same for those two -values. So, we might as well choose the same optimal distribution , even if there are multiple optimal options. Using different optima for different , even when the maximization problems are the same, would be “unnecessary complexity” in exactly the same sense as Conant & Ashby’s theorem.
So: every “simplest” (in the sense that it does not have any unnecessary variation in decision distribution) optimal (in the sense that it maximizes expected utility) regulator is a deterministic function of the posterior distribution of the system state . In other words, there is some equivalent regulator which first calculates the Bayesian posterior of given , then throws away and computes its output just from that distribution.
This solves the “imperfect knowledge” issue for free. When input data is not sufficient to perfectly estimate the system state , our regulator output is a function of the posterior distribution of , rather than itself.
When system state can be perfectly estimated from inputs, the distribution is itself a deterministic function of , therefore the regulator output will also be a deterministic function of .
Important note: I am not sure whether this result holds for minimum entropy. It is a qualitatively different problem, and in some ways more interesting—it’s more like an embedded agency problem, since decisions for one input-value can influence the optimal choice given other -values.
Making The Notion Of “Model” A Lot Less Silly
Finally, the main event. So far, we’ve said that regulators which are “optimal” and “simple” in various senses are equivalent to regulators which “use a model”—i.e. they first estimate the system state, then make a decision based on that estimate, ignoring the original input. Now we’ll see a condition under which “optimal” and “simple” regulators are not just equivalent to regulators which use a model, but in fact must use a model themselves.
Here’s the new picture:
Our regulator now receives two “rounds” of data (, then ) before choosing the output . In between, it chooses what information from to keep around—the retained information is the “model” . The interesting problem is to prove that, under certain conditions, will have properties which make the name “model” actually make sense.
Conceptually, “chooses which game” the regulator will play. In order to achieve optimal play across all “possible games” might choose, has to keep around any information relevant to any possible game. However, each game just takes as input (not directly), so at most has to keep around all the information relevant to . So: with a sufficiently rich “set of games” , we expect that will have to contain all information from relevant to .
On the flip side, we want this to be an information bottleneck: we want to contain as little information as possible (in an information-theoretic sense), while still achieving optimality. Combining this with the previous paragraph: we want to contain as little information as possible, while still containing all information from relevant to . That’s exactly the condition for the Minimal Map Theorem: must be (isomorphic to) the Bayesian distribution .
That’s what we’re going to prove: if is a minimum-information optimal summary of , for a sufficiently rich “set of games”, then is isomorphic to the Bayesian posterior distribution on given , i.e. . That’s the sense in which is a “model”.
As in the previous section, we can independently optimize for each -value:
Conceptually, our regulator sees the -value, then chooses a strategy , i.e. it chooses the distribution from which will be drawn for each value.
We’ll start with a simplifying assumption: there is a unique optimal regulator . (Note that we’re assuming the full black-box optimal function of the regulator is unique; there can still be internally-different optimal regulators with the same optimal black-box function, e.g. using different maps .) This assumption is mainly to simplify the proof; the conclusion survives without it, but we would need to track sets of optimal strategies everywhere rather than just “the optimal strategy”, and the minimal-information assumption would ultimately substitute for uniqueness of the optimal regulator.
If two -values yield the same Bayesian posterior , then they must yield the same optimal strategy . Proof: the optimization problems are the same, and the optimum is unique, so the strategy is the same. (In the non-unique case, picking different strategies would force to contain strictly more information—i.e. - so the minimal-information optimal regulator will pick identical strategies whenever it can do so. Making this reasoning fully work with many optimal -values takes a bit of effort and doesn’t produce much useful insight, but it works.)
The next step is more interesting: given a sufficiently rich set of games, not only is the strategy a function of the posterior, the posterior is a function of the strategy. If two -values yield the same strategy , then they must yield the same Bayesian posterior . What do we mean by “sufficiently rich set of games”? Well, given two different distributions and , there must be some particular -value for which the optimal strategy is different from . The key is that we only need one -value for which the optimal strategies differ between and .
So: by “sufficiently rich set of games”, we mean that for every pair of -values with different Bayesian posteriors , there exists some -value for which the optimal strategy differs. Conceptually: “sufficiently rich set of games” means that for each pair of two different possible posteriors , can pick at least one “game” (i.e. optimization problem) for which the optimal policy is different under the two posteriors.
From there, the proof is easy. The posterior is a function of the strategy, the strategy is a function of , therefore the posterior is a function of : two different posteriors and must have two different “models” and . On the other hand, we already know that the optimal strategy is a function of , so in in order for to be information-minimal it must not distinguish between -values with the same posterior . Thus: if-and-only-if . The “model” is isomorphic to the Bayesian posterior .
Takeaway
When should a regulator use a model internally? We have four key conditions:
The regulator needs to make optimal decisions (in an expected utility sense)
Information arrives in more than one timestep/chunk (, then ), and needs to be kept around until decision time
Keeping/passing information is costly: the amount of information stored/passed needs to be minimized (while still achieving optimal control)
Later information can “choose many different games”—specifically, whenever the posterior distribution of system-state given two possible values is different, there must be at least one value under which optimal play differs for the two values.
Conceptually, because we don’t know what game we’re going to play, we need to keep around all the information potentially relevant to any possible game. The minimum information which can be kept, while still keeping all the information potentially relevant to any possible game, is the Bayesian posterior on system state . There’s still a degree of freedom in how we encode the posterior on (that’s the “isomorphism” part), but the “model” M definitely has to store exactly the posterior.
- Selection Theorems: A Program For Understanding Agents by 28 Sep 2021 5:03 UTC; 123 points) (
- Natural Latents: The Concepts by 20 Mar 2024 18:21 UTC; 87 points) (
- Search-in-Territory vs Search-in-Map by 5 Jun 2021 23:22 UTC; 77 points) (
- What Selection Theorems Do We Expect/Want? by 1 Oct 2021 16:03 UTC; 67 points) (
- Voting Results for the 2021 Review by 1 Feb 2023 8:02 UTC; 66 points) (
- Some Summaries of Agent Foundations Work by 15 May 2023 16:09 UTC; 62 points) (
- Rationalists are missing a core piece for agent-like structure (energy vs information overload) by 17 Aug 2024 9:57 UTC; 59 points) (
- Some Existing Selection Theorems by 30 Sep 2021 16:13 UTC; 54 points) (
- Grokking the Intentional Stance by 31 Aug 2021 15:49 UTC; 45 points) (
- Selection processes for subagents by 30 Jun 2022 23:57 UTC; 36 points) (
- Information-Theoretic Boxing of Superintelligences by 30 Nov 2023 14:31 UTC; 30 points) (
- Deconfusing “ontology” in AI alignment by 8 Nov 2023 20:03 UTC; 28 points) (
- The causal backbone conjecture by 17 Aug 2024 18:50 UTC; 26 points) (
- A Straightforward Explanation of the Good Regulator Theorem by 18 Nov 2024 12:45 UTC; 24 points) (
- [AN #167]: Concrete ML safety problems and their relevance to x-risk by 20 Oct 2021 17:10 UTC; 21 points) (
- 17 Jun 2021 16:37 UTC; 13 points) 's comment on Reward Is Not Enough by (
- Natural Abstraction: Convergent Preferences Over Information Structures by 14 Oct 2023 18:34 UTC; 13 points) (
- Connecting the good regulator theorem with semantics and symbol grounding by 4 Mar 2021 14:35 UTC; 13 points) (
- [AN #138]: Why AI governance should find problems rather than just solving them by 17 Feb 2021 18:50 UTC; 12 points) (
- Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence by 27 Jun 2022 17:25 UTC; 12 points) (
- Thoughts on the good regulator theorem by 11 Aug 2022 12:08 UTC; 12 points) (
- 4 Aug 2022 23:52 UTC; 11 points) 's comment on Convergence Towards World-Models: A Gears-Level Model by (
- 22 Jun 2021 17:09 UTC; 11 points) 's comment on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction by (
- Requisite Variety by 21 Apr 2023 8:07 UTC; 6 points) (
- 20 Jan 2022 20:29 UTC; 6 points) 's comment on What’s Up With Confusingly Pervasive Goal Directedness? by (
- 30 Mar 2021 19:19 UTC; 6 points) 's comment on How do we prepare for final crunch time? by (
- 4 Nov 2024 14:44 UTC; 5 points) 's comment on Abstractions are not Natural by (
- 4 Jan 2024 11:59 UTC; 5 points) 's comment on Some Rules for an Algebra of Bayes Nets by (
- 11 Feb 2021 13:59 UTC; 3 points) 's comment on Has anybody used quantification over utility functions to define “how good a model is”? by (
- 3 May 2022 10:47 UTC; 2 points) 's comment on [Linkpost] A conceptual framework for consciousness by (
- 14 Nov 2023 18:42 UTC; 2 points) 's comment on They are made of repeating patterns by (
- 2 May 2024 18:06 UTC; 1 point) 's comment on Transformers Represent Belief State Geometry in their Residual Stream by (
I was impressed by this post. I don’t have the mathematical chops to evaluate it as math—probably it’s fairly trivial—but I think it’s rare for math to tell us something so interesting and important about the world, as this seems to do. See this comment where I summarize my takeaways; is it not quite amazing that these conclusions about artificial neural nets are provable (or provable-given-plausible-conditions) rather than just conjectures-which-seem-to-be-borne-out-by-ANN-behavior-so-far? (E.g. conclusions like “Neural nets trained on very complex open-ended real-world tasks/environments will build, remember, and use internal models of their environments… for something which resembles expected utility maximization!”) Anyhow, I guess I shouldn’t focus on the provability because even that’s not super important. What matters is that this seems to be a fairly rigorous argument for a conclusion which many people doubt, that is pretty relevant to this whole AGI safety thing.
It’s possible that I’m making mountains out of molehills here so I’d be interested to hear pushback. But as it stands I feel like the ideas in this post deserve to be turned into a paper and more widely publicized.
‘this comment where I summarize my takeaways’ appears to link to a high-lumen lightbulb on Amazon. I’d be interested in the actual comment! Is it this?
lol oops thank you!
Haha I was 99% sure, but I couldn’t tell if it was some elaborate troll or a joke I didn’t get (‘very bright idea’...?)