This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.
As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.
The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.
Planned opinion:
People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.
Selection theorems are helpful because they tell us likely properties of the agents we build.
What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):
Properties of humans as agents (e.g. “human values”)
Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
Properties of agents which we accidentally aim for (e.g. inner agency issues)
Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.
(What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.)
“Non-dominated” is always (to my knowledge) synonymous with “Pareto optimal”, same as the usage in game theory. It varies only to the extent that “pareto optimality of what?” varies; in the case of coherence theorems, it’s Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)
If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.
… I mean, that’s a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. “if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility”. Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.
(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)
Thanks for this and the response to my other comment, I understand where you’re coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:
This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).
As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.
New opinion:
People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.
I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
Agents selected by ML (e.g. RL training on games) also often have internal state.
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values
and
[...] the resulting agents can be represented as maximizing expected utility, if the agents don’t have internal state.
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
Planned summary for the Alignment Newsletter:
Planned opinion:
A few comments...
What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):
Properties of humans as agents (e.g. “human values”)
Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
Properties of agents which we accidentally aim for (e.g. inner agency issues)
Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.
“Non-dominated” is always (to my knowledge) synonymous with “Pareto optimal”, same as the usage in game theory. It varies only to the extent that “pareto optimality of what?” varies; in the case of coherence theorems, it’s Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)
… I mean, that’s a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. “if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility”. Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.
(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)
Thanks for this and the response to my other comment, I understand where you’re coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:
New opinion:
I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Agents selected by ML (e.g. RL training on games) also often have internal state.
Edited to
and
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
Cool, looks good.