Why does the AI even “want” failure mode 3? If it’s a RL agent, it’s not “motivated to maximize its reward”, it’s “motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward”. Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less “provably”) prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it’s not provable that it can’t) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be “infinite”, but it can be successfully prevented from “feeling” the infinite-ness of that potential reward, because the RL procedure itself doesn’t consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn’t subverting those safeguards unless they’ve already failed.)
Consider the strategy “do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry”. This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. None of your “the AI is an RL agent, so it shouldn’t want to break out” works any differently whether the box is a fancy concrete vacuum faraday cage, or just a cardboard box. A fancy box is only useful if there is something the AI wants, but is unable to do.
To a RL agent, if it hasn’t tried to break out of the box, then breaking out of the box is a case of generalization. For that matter, other forms of wireheading are also a case of generalization.
Is “do whatever action you predict to maximize the electricity in this particular piece of wire” really “general”? You’re basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of “wirehead”, the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it’s worth the work to dig deep enough to resolve our disagreement here.
Why does the AI even “want” failure mode 3? If it’s a RL agent, it’s not “motivated to maximize its reward”, it’s “motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward”. Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less “provably”) prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it’s not provable that it can’t) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be “infinite”, but it can be successfully prevented from “feeling” the infinite-ness of that potential reward, because the RL procedure itself doesn’t consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn’t subverting those safeguards unless they’ve already failed.)
Consider the strategy “do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry”. This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. None of your “the AI is an RL agent, so it shouldn’t want to break out” works any differently whether the box is a fancy concrete vacuum faraday cage, or just a cardboard box. A fancy box is only useful if there is something the AI wants, but is unable to do.
To a RL agent, if it hasn’t tried to break out of the box, then breaking out of the box is a case of generalization. For that matter, other forms of wireheading are also a case of generalization.
Is “do whatever action you predict to maximize the electricity in this particular piece of wire” really “general”? You’re basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of “wirehead”, the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it’s worth the work to dig deep enough to resolve our disagreement here.