I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
Agents selected by ML (e.g. RL training on games) also often have internal state.
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values
and
[...] the resulting agents can be represented as maximizing expected utility, if the agents don’t have internal state.
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
I think that’s a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
I agree with the literal content of this sentence, but I personally don’t imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Agents selected by ML (e.g. RL training on games) also often have internal state.
Edited to
and
(For the second one, that’s one of the reasons why I had the weasel word “could”, but on reflection it’s worth calling out explicitly given I mention it in the previous sentence.)
Cool, looks good.