Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button. [...] Therefore, it decides to not hit the reward button.
I think that subsection has the crucial insights from your post. Basically you’re saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg “pick up the trash”), there are multiple policies the agent could have, multiple meta-policies it could have, multiple ways it could modify or freeze its own cognition, etc… Whatever mental state it ultimately ends up with, the only constraint is that this state must be compatible with the reward signal in that limited environment.
Thus “always pick up trash” is one possible outcome; “wirehead the reward signal” is another. There are many other possibilities, with different generalisations of the initial reward-signal-in-limited-environment data.
I’d first note that a lot of effort in RL is put specifically into generalising the agent’s behaviour. The more effective this becomes, the closer the agent will be to the “wirehead the reward signal” side of things.
Even without this, this does not seem to point towards ways of making AGI safe, for two main reasons:
We are relying on some limitations of the environment or the AGI’s design, to prevent it from generalising to reward wireheading. Unless we understand what these limitations are doing in great detail, and how it interacts with the reward, we don’t know how or when the AGI will route around them. So they’re not stable or reliable.
The most likely attractor for the AGI is “maximise some correlate of the reward signal”. An unrestricted “trash-picking up” AGI is just as dangerous as a wireheading one; indeed, one could see it as another form of wireheading. So we have no reason to expect that the AGI is safe.
I think that subsection has the crucial insights from your post. Basically you’re saying that, if we train an agent via RL in a limited environment where the reward correlates with another goal (eg “pick up the trash”), there are multiple policies the agent could have, multiple meta-policies it could have, multiple ways it could modify or freeze its own cognition, etc… Whatever mental state it ultimately ends up with, the only constraint is that this state must be compatible with the reward signal in that limited environment.
Thus “always pick up trash” is one possible outcome; “wirehead the reward signal” is another. There are many other possibilities, with different generalisations of the initial reward-signal-in-limited-environment data.
I’d first note that a lot of effort in RL is put specifically into generalising the agent’s behaviour. The more effective this becomes, the closer the agent will be to the “wirehead the reward signal” side of things.
Even without this, this does not seem to point towards ways of making AGI safe, for two main reasons:
We are relying on some limitations of the environment or the AGI’s design, to prevent it from generalising to reward wireheading. Unless we understand what these limitations are doing in great detail, and how it interacts with the reward, we don’t know how or when the AGI will route around them. So they’re not stable or reliable.
The most likely attractor for the AGI is “maximise some correlate of the reward signal”. An unrestricted “trash-picking up” AGI is just as dangerous as a wireheading one; indeed, one could see it as another form of wireheading. So we have no reason to expect that the AGI is safe.