Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct U, and then the agent’s behavior is equivalent to a system that is maximizing U.
Says nothing about whether the agent internally constructs U and uses them.
(Little Less Silly version of the) Good regulator theorem: A regulator R that minimizes the entropy of a system variable S (where there is an environment variable X upstream of both R and S) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of S (despite being a function of X).
Says nothing about whether R actually internally reconstructs S and uses it to produce its output.
Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct G′ that is very similar to the true environment G. Then argue: “hence the agent must have something internally isomorphic to G”. Which is true, but …
says nothing about whether the agent actually uses those internal isomorphic-to-G structures in the causal history of computing its output.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
SLT represents a system as a parameterized function—very rich!
In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net—very impoverished!
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes S take minimum entropy, the first node must be isomorphic to the likelihood function s↦P(S=s|X).
This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph G, with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph E. Then claim: Agents with low regret must have G that admits an abstracting causal model map (summary) from E, and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
I’m being very very [imprecise/almost misleading] here—because I’m just trying to make a high-level point and the details don’t matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
Yeah, I think structural selection theorems matter a lot, for reasons I discussed here.
This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.
And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as “Program A, but different in ways X” then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a “distance” of at most the length of X in bits.
For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it’s probably of the form “A, but slightly different”, which means it’s agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)
It’s maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.
[Some thoughts that are similar but different to my previous comment;]
I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps.
Prove a behavioral theorem
add in a structural assumption
prove that behavioral result plus structural assumption implies structural result.
Behavior essentially serves as an “interface”, and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption).
Further claims: for any given structural class,
there will be a natural simplicity measure
simpler instances will be exponentially rare.
A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1]
So let’s say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it’s exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead.
Technically, you could have a subset of strings that didn’t grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather than exponentially. But this is clearly a less natural specification method.
There is a straightforward compmech take also.
If the goal of the agent is simply to predict well (let’s say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states).
Importantly the MSP must be used if optimal prediction is achieved.
There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs
Not much to add, I haven’t spent enough time thinking about structural selection theorems.
I’m a fan of making more assumptions. I’ve had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. “How could an agent store a utility function over all possible worlds?” or “Rice’s theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it’s impossible to do xyz”. The answer is always nah, it’s possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it’s easy to oversimplify the problem and end up with something useless.
I think I ger what you mean, though making more assumptions is perhaps not the best way to think about it. Logic is monotonic (classical logic at least), meaning that a valid proof remains valid even when adding more assumptions. The “taking advantage of some structure” seems to be different.
Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.
I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct U, and then the agent’s behavior is equivalent to a system that is maximizing U.
Says nothing about whether the agent internally constructs U and uses them.
(Little Less Silly version of the) Good regulator theorem: A regulator R that minimizes the entropy of a system variable S (where there is an environment variable X upstream of both R and S) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of S (despite being a function of X).
Says nothing about whether R actually internally reconstructs S and uses it to produce its output.
Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct G′ that is very similar to the true environment G. Then argue: “hence the agent must have something internally isomorphic to G”. Which is true, but …
says nothing about whether the agent actually uses those internal isomorphic-to-G structures in the causal history of computing its output.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
SLT represents a system as a parameterized function—very rich!
In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net—very impoverished!
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes S take minimum entropy, the first node must be isomorphic to the likelihood function s↦P(S=s|X).
This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Here is my 5-minute attempt to put more such “structure” to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:
Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph G, with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph E. Then claim: Agents with low regret must have G that admits an abstracting causal model map (summary) from E, and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
I’m being very very [imprecise/almost misleading] here—because I’m just trying to make a high-level point and the details don’t matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
Yeah, I think structural selection theorems matter a lot, for reasons I discussed here.
This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.
And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as “Program A, but different in ways X” then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a “distance” of at most the length of X in bits.
For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it’s probably of the form “A, but slightly different”, which means it’s agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)
It’s maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.
[Some thoughts that are similar but different to my previous comment;]
I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps.
Prove a behavioral theorem
add in a structural assumption
prove that behavioral result plus structural assumption implies structural result.
Behavior essentially serves as an “interface”, and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption).
Further claims: for any given structural class,
there will be a natural simplicity measure
simpler instances will be exponentially rare.
A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1]
So let’s say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it’s exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead.
Technically, you could have a subset of strings that didn’t grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather than exponentially. But this is clearly a less natural specification method.
There is a straightforward compmech take also. If the goal of the agent is simply to predict well (let’s say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states). Importantly the MSP must be used if optimal prediction is achieved.
There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs
Thoughts, @Jeremy Gillen?
Not much to add, I haven’t spent enough time thinking about structural selection theorems.
I’m a fan of making more assumptions. I’ve had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. “How could an agent store a utility function over all possible worlds?” or “Rice’s theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it’s impossible to do xyz”. The answer is always nah, it’s possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it’s easy to oversimplify the problem and end up with something useless.
Yes. I would even say that finding the right assumptions is the most important part of proving nontrivial selection theorems.
I think I ger what you mean, though making more assumptions is perhaps not the best way to think about it. Logic is monotonic (classical logic at least), meaning that a valid proof remains valid even when adding more assumptions. The “taking advantage of some structure” seems to be different.