This post contains some of my current best guesses at aspects of agent type signatures for which I expect there are useful Selection Theorems, as well as properties of selection optima which I expect are key to proving these type signatures.
This post assumes you have read the intro post on the Selection Theorem program. The intended audience is people who might work on the program, so these blurbs are intended to be link-heavy hooks and idea generators rather than self-contained explanations.
Aspects Of Agent Type Signature
These generally reflect my own current high-level thinking (though not all of the ideas are mine). For a different (and excellent, and better-written) take on similar questions from someone else, check out Mark and Evan’s Agents Over Cartesian World Models.
(A → B) → A
Scott Garrabrant’s post (A → B) → A provides the basic starting point for a Hofstadter-esque agent type signature. Breaking it down:
A represents the agent’s “actions”.
B represents the “outcomes” of those actions.
(A → B) is the agent’s model of what outcomes result from an action. It’s a function which takes in actions, and spits out outcomes (that’s the arrow notation).
(A → B) → A is the full agent: it takes in the model (A → B), and then outputs action A as a function of the model.
In pure abstract functional terms, the overall type signature is (A → B) → A. Lots of other type information is missing from this—there’s nothing about the types of the actions or the outcomes, for instance. But it seems like the right basic setup at a high level.
(A → B) → A + Abstraction
Generally speaking, low-level physical reality seems to be made of something like a giant causal networkwith symmetry. It doesn’t have any built-in “clouds” or “strange loops”. So, how do we get (A → B) → A structures in the physical world?
I expect that some chunks of the world naturally abstract into models with an agent-like type signature. In other words, approximately-all of the information about that chunk which propagates “far away” in the low-level causal network of the universe is summarized by a high-level model with an (A → B) → A signature.
This is the sort of “embedding” of agents for which I expect to find selection theorems.
World Models and Natural Abstractions
Humans’ world models sure do seem to have things in them—pencils, cars, other humans, etc. So a type signature for human world models (and, I expect, probably world models more generally) should give us some idea of what “things” are and how they correspond to structures in the world.
A different angle: to the extent that we expect to see roughly-Bayesian models (which is itself debatable, but at least better founded than EU maximization), what are the variables in those models, and how do they correspond to structures in the world?
This is another place where I expect the Natural Abstraction Hypothesis is relevant. Eventually, I expect to find selection theorems saying that agents evolve to reason using natural abstractions under a fairly broad range of conditions.
Goal Inputs
I’m surprised by just how much of a blindspot goal-inputs seem to be for today’s economists, AI researchers, etc. The coherence theorems usually cited to justify expected utility maximization models imply a quite narrow range of inputs to those utility functions: utilities are only over the outcomes on which agents can bet. Yet practitioners use utility functions over entire (unobservable) world states, world state trajectories, MDP states, etc, often without any way for the agent to bet on all of the outcomes.
Not that I think this is wrong; the inputs to human values seem to be abstract latent variables in humans’ world models, which are not things for which we can operationalize direct bets even in principle. (Indeed, this is arguably the fundamental barrier to making betting markets really useful.) The goal type signatures implied by the coherence theorems are not a good fit for realistic agenty systems.
I do expect that there are better theorems to be found—for instance, Description Length Minimization = Utility Maximization offers exactly the sort of goal input type signature I expect for humans, and connects it to information theoretic compressibility in a way which could plausibly be used in a selection theorem.
Combined with the Natural Abstraction hypothesis, I also expect there are selection theorems saying that evolved agents will tend to have goals whose inputs are (estimates/models of) natural abstractions from the environment.
Goal Outputs
The subagents argument says that the “output” of an agent’s goal should not be a utility, but rather a set of utilities, each one corresponding to a different subagent. This still seems basically correct to me—the more I’ve thought about it, the more I’m surprised by how well it matches human goals/values in practice (most recently, for instance, this, as well as a recent in-person discussion about how a lot of “moral insights” feel like pareto-gain trades between subagents which value different things).
The theorem still needs to be integrated with all the other pieces of an agent type signature, especially uncertainty. (I expect that different subagents can have different world models, as well as different utility functions.)
Internal Structure
The Selection Theorem problem I currently find most interesting is to say things about the internal structure of evolved agents. Does the internal structure have separate modules for world models, goals, search/optimization process, etc? More generally, do we expect some kind of modularity in evolved agenty systems? This is basically Scott Garrabrant’s question about whether agent-like behavior implies agent-like architecture, though for evolved agents specifically. I think, in the context of large-scale evolved agents in sufficiently-complex environments, the answer is “yes”. Most of the subsections below will discuss reasons for that belief and possible angles by which to prove it.
Aspects Of Selection
Most existing selection theorems focus on optimality—i.e. if the system isn’t behaviorally equivalent to a particular type signature, then it’s strictly suboptimal in some way. I do think there’s still some fruit to be picked there, but it’s not where I expect to find the most important selection theorems going forward.
Instead, I expect key results to come from characterizing broad and robust optima.
“Broad” means that the optimum is not too sensitive to parameter values—for instance, a “broad optimum” in natural selection would mean that most small mutations do not yield too large a fitness loss. Roughly speaking, optima have to be broad in order for natural selection to actually find them—no matter how high the fitness of a design, if everything “nearby” that design in mutation-space has low fitness, then natural selection has no way to get there. Same with gradient descent and other numerical optimization techniques used in ML/AI.
“Robust” means that the optimum is not too sensitive to the environment distribution—for instance, a “robust optimum” in natural selection would mean that the organism’s fitness will not plummet if there’s some small-but-permanent change in the environment. Real-world selection processes (whether natural selection or SGD or economic profitability) generally work with multiple finite samples from the environment, so we should expect sample noise to push toward more-robust optima.
Why would these two properties be crucial? Here’s a few different angles.
Better Generalization Through Search
Risks From Learned Optimization gives an informal argument that mesa-optimizers (i.e. evolved agents) become more probable as the complexity and variability of the environment increases—in other words, robust optima tend to involve agents.
Roughly argument: figuring out the right strategy takes some “optimization power”, measured in bits as the number of times we need to cut the strategy-search-space in half in order to find an optimal (or sufficiently-near-optimal) strategy. Some of that optimization can be done at “training time” by selecting a well-tuned system; some of it can be done at “runtime” by the selected system itself. However, any choices made at training time have to work for the whole range of possible environments; choices made at runtime can condition on whatever data has been observed for this particular environment-instance. So, if optimal choices vary a lot in a way which can be predicted from data available at runtime, then it makes sense to defer the optimization to runtime. Thus, agency: the system takes in data “at runtime” and then performs some optimization to make its choices.
The Gooder Regulator Theorem offers a similar view: there, an agent-like architecture is forced by the need to defer a decision until crucial data is available later on. However, the Gooder Regulator theorem only captures part of the “better generalization through search” argument from Risks From Learned Optimization; Gooder Regulator uses a somewhat-artificial information bottleneck rather than explicitly including a selection process (i.e. outer optimizer). Fully formalizing the argument seems to me like another very promising research direction.
Evolution of Modularity
Biological organisms are highly modular, at multiple different scales. This can be quantified and verified statistically, e.g. by mapping out protein networks and algorithmically partitioning them into parts, then comparing the connectivity of the parts. It can also be seen more qualitatively in everyday biological work: proteins have subunits which retain their function when fused to other proteins, receptor circuits can be swapped out to make bacteria follow different chemical gradients, manipulating specific genes can turn a fly’s antennae into legs, organs perform specific functions, etc, etc.
This seems to me like a particularly promising theoretical research direction—we have empirical results, but (as far as I know) no general selection theorems for the phenomenon. If we can get theorems about modularity of selected systems, then that would open the door to theorems about agent-like internal structure, as well as internal structure more generally.
Modularity of Broad Peaks
I don’t have a good reference on this one already written up, but here’s a few related hand-wavy arguments that broad optima should involve some sort of modularity.
If a system is “modular”, that means it has subsystems which have low-dimensional interfaces with the other subsystems. Now imagine that our selection process is optimizing the system via some very-high-dimensional parameters (e.g. a genome), and many of those parameters only influence a single subsystem (e.g. many genes only influence one module of an organism). If the dimension of the parameters of a subsystem is higher than the dimension of its interface with the other subsystems, then there should be many parameter changes which only affect the internals of the subsystem but keep high-level functionality intact. Thus, a broad optimum: there are many ways of changing the parameters without a large performance loss.
More questionable argument: if a system is modular, and most optimization parameters only affect one module, then most single-parameter changes (e.g. mutations in natural selection) will only affect one module. The rest of the system will still work fine, so as long as the high-level structure is robust to loss of one module, overall performance won’t be very much worse. On the other hand, if every parameter affects everything, then we can’t have this sort of architectural robustness.
Here’s one way that questionable argument turns out to be basically correct. (Warning: linear algebra and calculus incoming.) In smooth optimization problems (e.g. numerical optimization, like neural net training), we typically quantify the width of a peak via the determinant of the second derivative matrix of the objective with respect to the parameters—i.e. |∂2f∂θ2|. In two dimensions, for instance, we get
Key thing to notice: this is strictly decreasing with respect to the magnitude of the “second-order interaction term” ∂2f∂θ1∂θ2. All else equal, the peak is broadest when that interaction term is small. This turns out to generalize to higher dimensions: when the interaction terms are small, the peak is broad. With a bit more linear algebra, we can also extend this to hierarchical structure—roughly speaking, a peak will be relatively broad when the (sometimes implicit) interaction terms are relatively small between modules. This isn’t a particularly accessible/intuitive result, but the overall takeaway is similar to the other two arguments: modularity yields broad peaks.
Like the “evolution of modularity” idea, this offers a potential path to theorems on the internal structure of evolved systems, and (hopefully) the internal architecture of evolved agents.
What Selection Theorems Do We Expect/Want?
This post contains some of my current best guesses at aspects of agent type signatures for which I expect there are useful Selection Theorems, as well as properties of selection optima which I expect are key to proving these type signatures.
This post assumes you have read the intro post on the Selection Theorem program. The intended audience is people who might work on the program, so these blurbs are intended to be link-heavy hooks and idea generators rather than self-contained explanations.
Aspects Of Agent Type Signature
These generally reflect my own current high-level thinking (though not all of the ideas are mine). For a different (and excellent, and better-written) take on similar questions from someone else, check out Mark and Evan’s Agents Over Cartesian World Models.
(A → B) → A
Scott Garrabrant’s post (A → B) → A provides the basic starting point for a Hofstadter-esque agent type signature. Breaking it down:
A represents the agent’s “actions”.
B represents the “outcomes” of those actions.
(A → B) is the agent’s model of what outcomes result from an action. It’s a function which takes in actions, and spits out outcomes (that’s the arrow notation).
(A → B) → A is the full agent: it takes in the model (A → B), and then outputs action A as a function of the model.
In pure abstract functional terms, the overall type signature is (A → B) → A. Lots of other type information is missing from this—there’s nothing about the types of the actions or the outcomes, for instance. But it seems like the right basic setup at a high level.
(A → B) → A + Abstraction
Generally speaking, low-level physical reality seems to be made of something like a giant causal network with symmetry. It doesn’t have any built-in “clouds” or “strange loops”. So, how do we get (A → B) → A structures in the physical world?
I expect that some chunks of the world naturally abstract into models with an agent-like type signature. In other words, approximately-all of the information about that chunk which propagates “far away” in the low-level causal network of the universe is summarized by a high-level model with an (A → B) → A signature.
This is the sort of “embedding” of agents for which I expect to find selection theorems.
World Models and Natural Abstractions
Humans’ world models sure do seem to have things in them—pencils, cars, other humans, etc. So a type signature for human world models (and, I expect, probably world models more generally) should give us some idea of what “things” are and how they correspond to structures in the world.
A different angle: to the extent that we expect to see roughly-Bayesian models (which is itself debatable, but at least better founded than EU maximization), what are the variables in those models, and how do they correspond to structures in the world?
This is another place where I expect the Natural Abstraction Hypothesis is relevant. Eventually, I expect to find selection theorems saying that agents evolve to reason using natural abstractions under a fairly broad range of conditions.
Goal Inputs
I’m surprised by just how much of a blindspot goal-inputs seem to be for today’s economists, AI researchers, etc. The coherence theorems usually cited to justify expected utility maximization models imply a quite narrow range of inputs to those utility functions: utilities are only over the outcomes on which agents can bet. Yet practitioners use utility functions over entire (unobservable) world states, world state trajectories, MDP states, etc, often without any way for the agent to bet on all of the outcomes.
Not that I think this is wrong; the inputs to human values seem to be abstract latent variables in humans’ world models, which are not things for which we can operationalize direct bets even in principle. (Indeed, this is arguably the fundamental barrier to making betting markets really useful.) The goal type signatures implied by the coherence theorems are not a good fit for realistic agenty systems.
I do expect that there are better theorems to be found—for instance, Description Length Minimization = Utility Maximization offers exactly the sort of goal input type signature I expect for humans, and connects it to information theoretic compressibility in a way which could plausibly be used in a selection theorem.
Combined with the Natural Abstraction hypothesis, I also expect there are selection theorems saying that evolved agents will tend to have goals whose inputs are (estimates/models of) natural abstractions from the environment.
Goal Outputs
The subagents argument says that the “output” of an agent’s goal should not be a utility, but rather a set of utilities, each one corresponding to a different subagent. This still seems basically correct to me—the more I’ve thought about it, the more I’m surprised by how well it matches human goals/values in practice (most recently, for instance, this, as well as a recent in-person discussion about how a lot of “moral insights” feel like pareto-gain trades between subagents which value different things).
The theorem still needs to be integrated with all the other pieces of an agent type signature, especially uncertainty. (I expect that different subagents can have different world models, as well as different utility functions.)
Internal Structure
The Selection Theorem problem I currently find most interesting is to say things about the internal structure of evolved agents. Does the internal structure have separate modules for world models, goals, search/optimization process, etc? More generally, do we expect some kind of modularity in evolved agenty systems? This is basically Scott Garrabrant’s question about whether agent-like behavior implies agent-like architecture, though for evolved agents specifically. I think, in the context of large-scale evolved agents in sufficiently-complex environments, the answer is “yes”. Most of the subsections below will discuss reasons for that belief and possible angles by which to prove it.
Aspects Of Selection
Most existing selection theorems focus on optimality—i.e. if the system isn’t behaviorally equivalent to a particular type signature, then it’s strictly suboptimal in some way. I do think there’s still some fruit to be picked there, but it’s not where I expect to find the most important selection theorems going forward.
Instead, I expect key results to come from characterizing broad and robust optima.
“Broad” means that the optimum is not too sensitive to parameter values—for instance, a “broad optimum” in natural selection would mean that most small mutations do not yield too large a fitness loss. Roughly speaking, optima have to be broad in order for natural selection to actually find them—no matter how high the fitness of a design, if everything “nearby” that design in mutation-space has low fitness, then natural selection has no way to get there. Same with gradient descent and other numerical optimization techniques used in ML/AI.
“Robust” means that the optimum is not too sensitive to the environment distribution—for instance, a “robust optimum” in natural selection would mean that the organism’s fitness will not plummet if there’s some small-but-permanent change in the environment. Real-world selection processes (whether natural selection or SGD or economic profitability) generally work with multiple finite samples from the environment, so we should expect sample noise to push toward more-robust optima.
Why would these two properties be crucial? Here’s a few different angles.
Better Generalization Through Search
Risks From Learned Optimization gives an informal argument that mesa-optimizers (i.e. evolved agents) become more probable as the complexity and variability of the environment increases—in other words, robust optima tend to involve agents.
Roughly argument: figuring out the right strategy takes some “optimization power”, measured in bits as the number of times we need to cut the strategy-search-space in half in order to find an optimal (or sufficiently-near-optimal) strategy. Some of that optimization can be done at “training time” by selecting a well-tuned system; some of it can be done at “runtime” by the selected system itself. However, any choices made at training time have to work for the whole range of possible environments; choices made at runtime can condition on whatever data has been observed for this particular environment-instance. So, if optimal choices vary a lot in a way which can be predicted from data available at runtime, then it makes sense to defer the optimization to runtime. Thus, agency: the system takes in data “at runtime” and then performs some optimization to make its choices.
The Gooder Regulator Theorem offers a similar view: there, an agent-like architecture is forced by the need to defer a decision until crucial data is available later on. However, the Gooder Regulator theorem only captures part of the “better generalization through search” argument from Risks From Learned Optimization; Gooder Regulator uses a somewhat-artificial information bottleneck rather than explicitly including a selection process (i.e. outer optimizer). Fully formalizing the argument seems to me like another very promising research direction.
Evolution of Modularity
Biological organisms are highly modular, at multiple different scales. This can be quantified and verified statistically, e.g. by mapping out protein networks and algorithmically partitioning them into parts, then comparing the connectivity of the parts. It can also be seen more qualitatively in everyday biological work: proteins have subunits which retain their function when fused to other proteins, receptor circuits can be swapped out to make bacteria follow different chemical gradients, manipulating specific genes can turn a fly’s antennae into legs, organs perform specific functions, etc, etc.
How and why does natural selection produce modular systems? Simulations have found that modular variation in the goals (i.e. environment) results in the evolution of modular systems. In other words, modularity comes from robustness of the optimum to variation in the environment. (Assuming modularity of the environment, which would follow from the Natural Abstraction Hypothesis.)
This seems to me like a particularly promising theoretical research direction—we have empirical results, but (as far as I know) no general selection theorems for the phenomenon. If we can get theorems about modularity of selected systems, then that would open the door to theorems about agent-like internal structure, as well as internal structure more generally.
Modularity of Broad Peaks
I don’t have a good reference on this one already written up, but here’s a few related hand-wavy arguments that broad optima should involve some sort of modularity.
If a system is “modular”, that means it has subsystems which have low-dimensional interfaces with the other subsystems. Now imagine that our selection process is optimizing the system via some very-high-dimensional parameters (e.g. a genome), and many of those parameters only influence a single subsystem (e.g. many genes only influence one module of an organism). If the dimension of the parameters of a subsystem is higher than the dimension of its interface with the other subsystems, then there should be many parameter changes which only affect the internals of the subsystem but keep high-level functionality intact. Thus, a broad optimum: there are many ways of changing the parameters without a large performance loss.
More questionable argument: if a system is modular, and most optimization parameters only affect one module, then most single-parameter changes (e.g. mutations in natural selection) will only affect one module. The rest of the system will still work fine, so as long as the high-level structure is robust to loss of one module, overall performance won’t be very much worse. On the other hand, if every parameter affects everything, then we can’t have this sort of architectural robustness.
Here’s one way that questionable argument turns out to be basically correct. (Warning: linear algebra and calculus incoming.) In smooth optimization problems (e.g. numerical optimization, like neural net training), we typically quantify the width of a peak via the determinant of the second derivative matrix of the objective with respect to the parameters—i.e. |∂2f∂θ2|. In two dimensions, for instance, we get
|∂2f∂θ2|=|⎛⎜ ⎜ ⎜⎝∂2f∂θ21∂2f∂θ1∂θ2∂2f∂θ1∂θ2∂2f∂θ22⎞⎟ ⎟ ⎟⎠|=∂2f∂θ21∂2f∂θ22−(∂2f∂θ1∂θ2)2
Key thing to notice: this is strictly decreasing with respect to the magnitude of the “second-order interaction term” ∂2f∂θ1∂θ2. All else equal, the peak is broadest when that interaction term is small. This turns out to generalize to higher dimensions: when the interaction terms are small, the peak is broad. With a bit more linear algebra, we can also extend this to hierarchical structure—roughly speaking, a peak will be relatively broad when the (sometimes implicit) interaction terms are relatively small between modules. This isn’t a particularly accessible/intuitive result, but the overall takeaway is similar to the other two arguments: modularity yields broad peaks.
Like the “evolution of modularity” idea, this offers a potential path to theorems on the internal structure of evolved systems, and (hopefully) the internal architecture of evolved agents.