I am an aspiring selection theorist and I have thoughts.
Why Selection Theorems?
Learning about selection theorems was very exciting. It’s one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger.
Selection Theorems as a Compelling Agent Foundations Paradigm
I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it’s general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research “surface area” and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program.
Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply[1].
I currently consider selection theorems to be the most promising agent foundations flavoured research paradigm.
Digression: Asymptotic Analysis and Complexity Theory
My preferred analogy for selection theorems is asymptotic complexity in computer science. Using asymptotic analysis we can make highly non-trivial statements about the performance of particular (or arbitrary!) algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the asymptotic/complexity theoretic guarantee will generally still apply.
For example, we have a very robust proof that no comparison-based sorting algorithm can attain better worst case time complexity than O(nlogn) (this happens to be a very tight lower bound as extant algorithms [e.g. mergesort] attain it). The model behind the lower bound of comparison sorting is very minimal and general:
Data operations
Comparing two elements
Moving elements (copying or swapping)
Cost: number of such operations
Any algorithm that performs sorting by directly comparing elements to determine their order conforms to this model. The lower bound of O(nlogn) obtains because we can model the execution of the sorting algorithm by a binary decision tree:
Nodes: individual comparisons between elements
Edges: different outputs of comparisons (≤ and >)
Leaf nodes: unique permutation of the input array that corresponds to that particular root to leaf path of the tree
The number of executions of the sorting algorithm for any given input permutation is given by the number of edges between the root node and that leaf. The worst case running time of the algorithm is given by the height of the tree. Because there are n! possible permutations of the input array, the lowest attainable worst case complexity is lg(n!)=nlg(n)−lg(e)n+O(logn). Which is in O(nlogn).
I reiterate that this is a very powerful result. Here we’ve set up very minimal assumptions about our model (comparisons are made between pairs of elements to determine order, the algorithm can copy or swap elements) and we’ve obtained a ridiculously strong impossibility result[2].
Selection Theorems as a Complexity Theoretic Analogue
Selection theorems present a minimal model of an intelligent system as an agent situated in an environment. The agents are assumed to be the product of some optimisation process selecting for performance on a given metric (e.g. inexploitability in multi-agent environments, for the coherence theorems).
The exact optimisation process performing the selection is abstracted away [only the performance metric/objective function(s) of optimisation matters], and the hope is to do the same for the environment (that is, selection theoretic results should apply to a broad class of environments (e.g. for the coherence theorems, the only constraint imposed on the environment is that it contains other agents)].
Using the above model, selection theorems try to derive[3] agent “type signatures” (the representation [data structures], interfaces [inputs & outputs] and embedding [in an underlying physical (or other low level) system] of the agent and specific agent aspects (world models, goals, etc.). It’s through these type signatures that safety relevant properties of agents can be concretely formulated (and hopefully proven).
While this is a negative result, I expect no fundamental difficulty to obtaining positive selection theoretic guarantees of safety properties.
I see the promise of selection theorems as doing for AI safety, what complexity theory does for algorithm performance.
The Power of Selection Theorems
I expect that we will be able to provide selection theoretic guarantees of nontrivial safety properties/desiderata.
In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit[5] of optimisation for particular objectives (convergence theorems?). I find the potential of asymptotic guarantees exhilarating.
Properties proven to emerge in the limit become more robust with (increasing) scale. I think that’s an incredibly powerful result. Furthermore, asymptotic complexity analysis suggests that it’s often easier to make statements about what holds in the limit than about what holds at particular intermediate levels. (We can very easily talk about how the performance of two algorithms compare on a particular problem as input size tends towards infinity without considering implementation details or underlying hardware and ignoring all constant factors. To talk about the performance of two algorithms on input sets of a particular fixed size, we’d need to consider all the aforementioned details).
The combination of:
“Properties that are selected for in the limit become more robust with (increasing) scale” and
“It is much easier to describe the limit of a process than particular intermediate states”
is immensely powerful[6]. It makes selection theorems a hugely compelling — perhaps the one I find most personally compelling — AI safety research paradigm.
Reservations
While I am quite enamoured with Wentworth’sselectiontheorems, I find myself somewhat dissatisfied. As Wentworth framed it, I think they are a bit off.
A major limitation of the coherence theorems is that they constrain agents to an archetype that does not necessarily describe real agents (or other intelligent systems) well. In particular, the coherence theorems assume agent preferences are:
Static (do not change with time)
Path independent (exact course of action taken to get somewhere does not affect the agent’s preferences, alternatively it assumes that agents do not have internal states that factor into their preferences)
Complete (for any two options, the agent prefers one of them or is indifferent. It doesn’t permit a notion of “incomplete preferences”)
The failure of coherence theorems to carve reality at the joints is a valuable lesson re: choosing the right preconditions for our theorems (if our preconditions are too restrictive/strong, they might describe systems that don’t matter in the real world [“spherical cows”]). And it’s a mistake I worry that the paradigm of “agent type signatures” might be making.
To be precise, I am quite unconvinced that “agent” is the “true name” of the relevant intelligent systems. There are powerful artifacts (e.g. the base versions of large language models) that do not match the agent archetype as traditionally conceived. I do not know that the artifacts that ultimately matter would necessarily conform to the agent archetype[7]. Theorems that are exclusively about the properties of agents may end up not being very applicable to important systems of interest (if e.g. the first AGIs are created by a [mostly] self-supervised training process).
Agent selection theorems are IMO ultimately too restrictive (their preconditions are too strong to describe all intelligent systems of interest/they implicitly preclude from analysis some intelligent systems we’ll be interested in), and the selection theorem agenda should be generalised to optimisation processes and the kind of constructs they select for.
That is, regardless of paradigm, intelligent systems (e.g. humans, trained ML models and expert systems) are the products of optimisation processes (e.g. natural selection, stochastic gradient descent, and human design[8] respectively).
So, a theory based solely on optimisation processes seems general enough to describe all intelligent systems of interest (while being targeted enough to say nontrivial/interesting things about such systems) and minimal (we can’t relax the preconditions anymore while still obtaining nontrivial results about intelligent systems).
The agent type signature paradigm is insufficiently general.
In the remainder of this post, I would like to slightly adjust the concept of selection theorems to better reflect what I think they should be[9].
Types of Selection Theorems
There are two broad classes of theorems that seem valuable:
Constructor Theorems
For a given (collection of) objective(s), and underlying configuration space what type[10] of artifacts are produced by constructive optimisation processes (e.g. natural selection, stochastic gradient descent and human design) that select for performance on said objective(s)?
Fundamentally, they ask the question:
What properties are selected for by optimisation for a particular (collection of) objective function(s)?
The aforementioned “convergence theorems” would be a particular kind of constructor theorems.
Artifact Theorems
Artifact theorems are the dual of constructor theorems. If constructor theorems seek to identify the artifact type produced by a particular constructive optimisation process, then artifact theorems seek to identify the constructive optimisation process that produced particular artifacts (the human brain, trained ML models and the quicksort algorithm respectively).
That is:
For a given artifact type and associated configuration spaces, what were the objectives[11] of the optimisation process that produced it?
I.e. describe the class of problems/domains/tasks the objectives belong to
Can we also specify a type for the objectives?
What properties do its members have?
Which properties are necessary to select for that artifact type?
What is its parent type?
Which properties are sufficient?
What are the interesting child types?
I suspect that e.g. investigating general intelligence artifact theorems would be a promising research agenda for robust safety of arbitrarily capable general systems.
I should point out that this impossibility result is somewhat atypical; for many interesting problems we don’t regularly obtain (non-trivial [e.g. the size of the input or output]) tight lower bounds on complexity.
I’m under the impression that it was when thinking about what emerges in the limit that I first drew the relationship between selection theorems and complexity theory. However, this may be a false memory (or otherwise not a particularly reliable recollection of events).
While any physical system can be constituted as an agent situated in an environment, the agent archetype is not illuminating for all of them. Viewing a calculator as an agent does not really offer any missing insight into the operations of the calculator. It does not allow you to better predict its behaviour.
Insomuch as one accepts that design is a kind of optimisation process. And I would insist that you should, but I’ve not gotten around to writing up my thoughts on what exactly qualifies as an optimisation process in a form that I would endorse. Eliezer’s “Measuring Optimisation Power” is a fine enough first approximation
Among other things, a type should specify a set of properties that all members of the type share. If those properties are necessary and sufficient for an artifact to belong to a particular type, the type could simply be identified with its collection of properties.
Types can exist at different levels of abstraction (allowing them to specify artifact properties at different levels of detail).
An artifact can belong to multiple types (e.g. I might belong to the types: “human”, “male”, “Nigerian”).
Rather than identifying the optimisation process in detail, only the objective function of the optimisation process is considered. Any other particulars/specifics of the optimisation process are abstracted away (the same way implementation details are abstracted away in asymptotic analysis).
The motivation is that I think that any two optimisation processes with the same objective functions on the same configuration space with the same “optimisation power” are identical for our purposes. And for convergence theorems, even the optimisation power is abstracted away.
Epistemic Status
I am an aspiring selection theorist and I have thoughts.
Why Selection Theorems?
Learning about selection theorems was very exciting. It’s one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger.
Selection Theorems as a Compelling Agent Foundations Paradigm
There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don’t quite describe the AI systems we deal with in practice.
I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it’s general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research “surface area” and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program.
Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply[1].
I currently consider selection theorems to be the most promising agent foundations flavoured research paradigm.
Digression: Asymptotic Analysis and Complexity Theory
My preferred analogy for selection theorems is asymptotic complexity in computer science. Using asymptotic analysis we can make highly non-trivial statements about the performance of particular (or arbitrary!) algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the asymptotic/complexity theoretic guarantee will generally still apply.
For example, we have a very robust proof that no comparison-based sorting algorithm can attain better worst case time complexity than O(nlogn) (this happens to be a very tight lower bound as extant algorithms [e.g. mergesort] attain it). The model behind the lower bound of comparison sorting is very minimal and general:
Data operations
Comparing two elements
Moving elements (copying or swapping)
Cost: number of such operations
Any algorithm that performs sorting by directly comparing elements to determine their order conforms to this model. The lower bound of O(nlogn) obtains because we can model the execution of the sorting algorithm by a binary decision tree:
Nodes: individual comparisons between elements
Edges: different outputs of comparisons (≤ and >)
Leaf nodes: unique permutation of the input array that corresponds to that particular root to leaf path of the tree
The number of executions of the sorting algorithm for any given input permutation is given by the number of edges between the root node and that leaf. The worst case running time of the algorithm is given by the height of the tree. Because there are n! possible permutations of the input array, the lowest attainable worst case complexity is lg(n!)=nlg(n)−lg(e)n+O(logn). Which is in O(nlogn).
I reiterate that this is a very powerful result. Here we’ve set up very minimal assumptions about our model (comparisons are made between pairs of elements to determine order, the algorithm can copy or swap elements) and we’ve obtained a ridiculously strong impossibility result[2].
Selection Theorems as a Complexity Theoretic Analogue
Selection theorems present a minimal model of an intelligent system as an agent situated in an environment. The agents are assumed to be the product of some optimisation process selecting for performance on a given metric (e.g. inexploitability in multi-agent environments, for the coherence theorems).
The exact optimisation process performing the selection is abstracted away [only the performance metric/objective function(s) of optimisation matters], and the hope is to do the same for the environment (that is, selection theoretic results should apply to a broad class of environments (e.g. for the coherence theorems, the only constraint imposed on the environment is that it contains other agents)].
Using the above model, selection theorems try to derive[3] agent “type signatures” (the representation [data structures], interfaces [inputs & outputs] and embedding [in an underlying physical (or other low level) system] of the agent and specific agent aspects (world models, goals, etc.). It’s through these type signatures that safety relevant properties of agents can be concretely formulated (and hopefully proven).
For example, the proposed anti-naturalness of corrigibility to expected utility maximisation can be seen as an “impossibility result”[4] of a safety property (corrigibility) derived from a selection theorem (the coherence theorems).
While this is a negative result, I expect no fundamental difficulty to obtaining positive selection theoretic guarantees of safety properties.
I see the promise of selection theorems as doing for AI safety, what complexity theory does for algorithm performance.
The Power of Selection Theorems
I expect that we will be able to provide selection theoretic guarantees of nontrivial safety properties/desiderata.
In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge in the limit[5] of optimisation for particular objectives (convergence theorems?). I find the potential of asymptotic guarantees exhilarating.
Properties proven to emerge in the limit become more robust with (increasing) scale. I think that’s an incredibly powerful result. Furthermore, asymptotic complexity analysis suggests that it’s often easier to make statements about what holds in the limit than about what holds at particular intermediate levels. (We can very easily talk about how the performance of two algorithms compare on a particular problem as input size tends towards infinity without considering implementation details or underlying hardware and ignoring all constant factors. To talk about the performance of two algorithms on input sets of a particular fixed size, we’d need to consider all the aforementioned details).
The combination of:
“Properties that are selected for in the limit become more robust with (increasing) scale” and
“It is much easier to describe the limit of a process than particular intermediate states”
is immensely powerful[6]. It makes selection theorems a hugely compelling — perhaps the one I find most personally compelling — AI safety research paradigm.
Reservations
While I am quite enamoured with Wentworth’s selection theorems, I find myself somewhat dissatisfied. As Wentworth framed it, I think they are a bit off.
A major limitation of the coherence theorems is that they constrain agents to an archetype that does not necessarily describe real agents (or other intelligent systems) well. In particular, the coherence theorems assume agent preferences are:
Static (do not change with time)
Path independent (exact course of action taken to get somewhere does not affect the agent’s preferences, alternatively it assumes that agents do not have internal states that factor into their preferences)
Complete (for any two options, the agent prefers one of them or is indifferent. It doesn’t permit a notion of “incomplete preferences”)
These assumptions turn out to be not very realistic/don’t describe real world agents (e.g. humans) and some (relatively) inexploitable systems (e.g. financial markets) well.
The failure of coherence theorems to carve reality at the joints is a valuable lesson re: choosing the right preconditions for our theorems (if our preconditions are too restrictive/strong, they might describe systems that don’t matter in the real world [“spherical cows”]). And it’s a mistake I worry that the paradigm of “agent type signatures” might be making.
To be precise, I am quite unconvinced that “agent” is the “true name” of the relevant intelligent systems. There are powerful artifacts (e.g. the base versions of large language models) that do not match the agent archetype as traditionally conceived. I do not know that the artifacts that ultimately matter would necessarily conform to the agent archetype[7]. Theorems that are exclusively about the properties of agents may end up not being very applicable to important systems of interest (if e.g. the first AGIs are created by a [mostly] self-supervised training process).
Agent selection theorems are IMO ultimately too restrictive (their preconditions are too strong to describe all intelligent systems of interest/they implicitly preclude from analysis some intelligent systems we’ll be interested in), and the selection theorem agenda should be generalised to optimisation processes and the kind of constructs they select for.
That is, regardless of paradigm, intelligent systems (e.g. humans, trained ML models and expert systems) are the products of optimisation processes (e.g. natural selection, stochastic gradient descent, and human design[8] respectively).
So, a theory based solely on optimisation processes seems general enough to describe all intelligent systems of interest (while being targeted enough to say nontrivial/interesting things about such systems) and minimal (we can’t relax the preconditions anymore while still obtaining nontrivial results about intelligent systems).
The agent type signature paradigm is insufficiently general.
In the remainder of this post, I would like to slightly adjust the concept of selection theorems to better reflect what I think they should be[9].
Types of Selection Theorems
There are two broad classes of theorems that seem valuable:
Constructor Theorems
For a given (collection of) objective(s), and underlying configuration space what type[10] of artifacts are produced by constructive optimisation processes (e.g. natural selection, stochastic gradient descent and human design) that select for performance on said objective(s)?
Fundamentally, they ask the question:
The aforementioned “convergence theorems” would be a particular kind of constructor theorems.
Artifact Theorems
Artifact theorems are the dual of constructor theorems. If constructor theorems seek to identify the artifact type produced by a particular constructive optimisation process, then artifact theorems seek to identify the constructive optimisation process that produced particular artifacts (the human brain, trained ML models and the quicksort algorithm respectively).
That is:
I.e. describe the class of problems/domains/tasks the objectives belong to
Can we also specify a type for the objectives?
What properties do its members have?
Which properties are necessary to select for that artifact type?
What is its parent type?
Which properties are sufficient?
What are the interesting child types?
I suspect that e.g. investigating general intelligence artifact theorems would be a promising research agenda for robust safety of arbitrarily capable general systems.
Provided we use sufficiently general agent/system models as the foundation for our selection theoretic results.
I should point out that this impossibility result is somewhat atypical; for many interesting problems we don’t regularly obtain (non-trivial [e.g. the size of the input or output]) tight lower bounds on complexity.
Usually, some parts of the type signatures are assumed (implicitly or explicitly) by the theorem.
Jessica Taylor told me that she thinks the anti-naturalness of corrigibility is more of a “research intuition” than a theorem.
I’m under the impression that it was when thinking about what emerges in the limit that I first drew the relationship between selection theorems and complexity theory. However, this may be a false memory (or otherwise not a particularly reliable recollection of events).
It feels almost too good to be true, like we’re cheating in the mileage we get out of selection theorems.
While any physical system can be constituted as an agent situated in an environment, the agent archetype is not illuminating for all of them. Viewing a calculator as an agent does not really offer any missing insight into the operations of the calculator. It does not allow you to better predict its behaviour.
Insomuch as one accepts that design is a kind of optimisation process. And I would insist that you should, but I’ve not gotten around to writing up my thoughts on what exactly qualifies as an optimisation process in a form that I would endorse. Eliezer’s “Measuring Optimisation Power” is a fine enough first approximation
The quickest gloss is that:
- “Agent” should be replaced with “artifact” (a general term for any object that is the product of an optimisation process).
Some sample artifacts and the optimisation process that produced them:
* The human brain: natural selection
* Trained ML models: stochastic gradient descent
* 1.41421356237: Newton’s method (approximation for the square root of 2)
* The quicksort algorithm: human design
Among other things, a type should specify a set of properties that all members of the type share. If those properties are necessary and sufficient for an artifact to belong to a particular type, the type could simply be identified with its collection of properties.
Types can exist at different levels of abstraction (allowing them to specify artifact properties at different levels of detail).
An artifact can belong to multiple types (e.g. I might belong to the types: “human”, “male”, “Nigerian”).
Rather than identifying the optimisation process in detail, only the objective function of the optimisation process is considered. Any other particulars/specifics of the optimisation process are abstracted away (the same way implementation details are abstracted away in asymptotic analysis).
The motivation is that I think that any two optimisation processes with the same objective functions on the same configuration space with the same “optimisation power” are identical for our purposes. And for convergence theorems, even the optimisation power is abstracted away.
Quick review of the review, this could indeed make a very good top-level post.
@Raemon: here’s the review I mentioned wanting to write.
I’m wiped for the current writing session, but may extend it further later in the day over the coming week?
[When does the review session end?]