Thanks for this and the response to my other comment, I understand where you’re coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:
This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).
As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.
New opinion:
People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.
I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
Agents selected by ML (e.g. RL training on games) also often have internal state.
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values
and
[...] the resulting agents can be represented as maximizing expected utility, if the agents don’t have internal state.
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
Thanks for this and the response to my other comment, I understand where you’re coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:
New opinion:
I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Agents selected by ML (e.g. RL training on games) also often have internal state.
Edited to
and
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
Cool, looks good.