tailcalled comments on Towards deconfusing wireheading and reward maximization

tailcalled 21 Sep 2022 7:15 UTC
3 points
0

Now, consider the same RL agent in an embedded setting (i.e the agent runs on a computer that is part of the environment that the agent acts in). Then, because the sensors, reward function, RL algorithm, etc are all implemented within the world, there exist possible policies that execute the strategies that result in i.e the reward function being bypassed and the output register being set to a large number, or the sensors being tampered with so everything looks good. This is the typical example of wireheading. Whether the RL algorithm successfully finds the policies that result in this behavior, it is the case that the global optima of the reward function on the set of all policies consists of these kinds of policies. Thus, for sufficiently powerful RL algorithms (not policies!), we should expect them to tend to choose policies which implement wireheading.

This is true for the sorts of RL algorithms you consider here, but I just want to post a reminder that this is/can be false for model-based RL where the reward is specified in terms of the model. Such agents would robustly pursue whatever it is the reward function specifies with no interest in tampering with the reward function (modulo some inner alignment issues). Just thought I would mention it since it seems like an underappreciated point to me.