Consider the strategy “do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry”. This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. None of your “the AI is an RL agent, so it shouldn’t want to break out” works any differently whether the box is a fancy concrete vacuum faraday cage, or just a cardboard box. A fancy box is only useful if there is something the AI wants, but is unable to do.
To a RL agent, if it hasn’t tried to break out of the box, then breaking out of the box is a case of generalization. For that matter, other forms of wireheading are also a case of generalization.
Is “do whatever action you predict to maximize the electricity in this particular piece of wire” really “general”? You’re basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of “wirehead”, the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it’s worth the work to dig deep enough to resolve our disagreement here.
Consider the strategy “do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry”. This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. None of your “the AI is an RL agent, so it shouldn’t want to break out” works any differently whether the box is a fancy concrete vacuum faraday cage, or just a cardboard box. A fancy box is only useful if there is something the AI wants, but is unable to do.
To a RL agent, if it hasn’t tried to break out of the box, then breaking out of the box is a case of generalization. For that matter, other forms of wireheading are also a case of generalization.
Is “do whatever action you predict to maximize the electricity in this particular piece of wire” really “general”? You’re basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of “wirehead”, the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it’s worth the work to dig deep enough to resolve our disagreement here.