This paper describes a kind of “partially embedded” agent, where the agent explicitly models its reward/utility function (but not other parts of itself) as belonging to the environment and subject to modification (by itself or by the environment), and shows that if an agent uses its current utility function to decide what to do, it won’t have an incentive to modify its utility function, and if it properly models what would happen if the utility function is changed by the environment, it will also want to protect its utility function. The paper seems to spend a lot of pages on a relatively simple/intuitive idea that has been discussed on LW in various forms for at least a decade, but maybe this kind of detailed formal treatment will be useful for making some people (ML researchers?) take AI safety ideas more seriously?
I can’t resist giving this pair of rather incongruous quotes from the paper:
Fortunately and perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.
A known reward function also brings RL closer to the frameworks of decision theory and game theory (Osborne, 2003; Steele and Stefansson, 2016), where agents are usually aware of their own reward (or utility) function, and are only uncertain about the outcomes for different actions or policies.
The authors acknowledged that the modifications they did to RL “brings RL closer to the frameworks of decision theory and game theory” (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write “perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.”
Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted “perhaps surprisingly” as a claim of novelty instead of an attempt to raise the interest of the target audience.
This paper describes a kind of “partially embedded” agent, where the agent explicitly models its reward/utility function (but not other parts of itself) as belonging to the environment and subject to modification (by itself or by the environment), and shows that if an agent uses its current utility function to decide what to do, it won’t have an incentive to modify its utility function, and if it properly models what would happen if the utility function is changed by the environment, it will also want to protect its utility function. The paper seems to spend a lot of pages on a relatively simple/intuitive idea that has been discussed on LW in various forms for at least a decade, but maybe this kind of detailed formal treatment will be useful for making some people (ML researchers?) take AI safety ideas more seriously?
I can’t resist giving this pair of rather incongruous quotes from the paper:
Could you spell out what makes the quotes incongruous with each other? It’s not jumping out at me.
The authors acknowledged that the modifications they did to RL “brings RL closer to the frameworks of decision theory and game theory” (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write “perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.”
We didn’t expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.
Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted “perhaps surprisingly” as a claim of novelty instead of an attempt to raise the interest of the target audience.