I know of inverse reinforcement learning and similar ideas, I still argue that they are bad for the same reason.
In regular reinforcement learning, the human presses a button that says “GOOD”, and a sufficiently intelligent AI learns that it can just steal the button and press it itself.
In inverse reinforcement learning, the human presses a button that says “GOOD” at first. Then the button is turned off, and the AI is told to predict what actions would have led to the button being pressed. Instead of actual reinforcement, there is merely predicted reinforcement.
However a sufficiently intelligent AI should predict that stealing the button would have resulted in the button being pressed, and so it will still do that. Even though the button is turned off, the AI is trying to predict what would be best in the counter-factual world where the button is still on.
And so the programmer thinks that they have taught the AI to understand what is good, but really they have just taught it to figure out how to press a button labelled “GOOD”.
This is not how IRL works at all. The utility function does not come from a special reward channel controlled by a human. There is no button.
To reiterate my description earlier, IRL is based on inferring the unknown utility function of an agent given examples of the agent’s behaviour in terms of observations and actions. The utility function is entirely an internal component of the model.
I know of inverse reinforcement learning and similar ideas, I still argue that they are bad for the same reason.
In regular reinforcement learning, the human presses a button that says “GOOD”, and a sufficiently intelligent AI learns that it can just steal the button and press it itself.
In inverse reinforcement learning, the human presses a button that says “GOOD” at first. Then the button is turned off, and the AI is told to predict what actions would have led to the button being pressed. Instead of actual reinforcement, there is merely predicted reinforcement.
However a sufficiently intelligent AI should predict that stealing the button would have resulted in the button being pressed, and so it will still do that. Even though the button is turned off, the AI is trying to predict what would be best in the counter-factual world where the button is still on.
And so the programmer thinks that they have taught the AI to understand what is good, but really they have just taught it to figure out how to press a button labelled “GOOD”.
This is not how IRL works at all. The utility function does not come from a special reward channel controlled by a human. There is no button.
To reiterate my description earlier, IRL is based on inferring the unknown utility function of an agent given examples of the agent’s behaviour in terms of observations and actions. The utility function is entirely an internal component of the model.