Hypothesis: there’s a way of formalizing the notion of “empowerment” such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn’t ever let the humans spend that power. Intuitively, though, there’s a sense in which a human who can never spend their power doesn’t actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl’s do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they’d had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it’s not very sensitive to the precise definition of G (especially if the AI isn’t actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions).
The problem here is that these counterfactuals aren’t very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question “what would the AI be doing in this world?” has no sensible answer (or maybe the answer would be “it would realize it’s in a weird hypothetical world and behave accordingly”). Similarly, if we model this using the do-operation, the best policy is something like “wait until the human’s goals suddenly and inexplicably change, then optimize hard for their new goal”.
Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl’s do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
There’s also the problem of: what do you mean by “the human”? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you’re perfectly capable of making different decisions, but you just don’t.
Another problem, which I like to think of as the “control panel of the universe” problem, is where the AI gives you the “control panel of the universe”, but you aren’t smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I think any model of a rational agent needs to incorporate the fact that they’re not arbitrarily intelligent, otherwise none of their actions make sense. So I’m not too worried about this.
If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power.
Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don’t know how to actually pin down these hypotheticals.
Hypothesis: there’s a way of formalizing the notion of “empowerment” such that an AI with the goal of empowering humans would be corrigible.
This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn’t ever let the humans spend that power. Intuitively, though, there’s a sense in which a human who can never spend their power doesn’t actually have any power. Is there a way of formalizing that intuition?
The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl’s do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they’d had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it’s not very sensitive to the precise definition of G (especially if the AI isn’t actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions).
The problem here is that these counterfactuals aren’t very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question “what would the AI be doing in this world?” has no sensible answer (or maybe the answer would be “it would realize it’s in a weird hypothetical world and behave accordingly”). Similarly, if we model this using the do-operation, the best policy is something like “wait until the human’s goals suddenly and inexplicably change, then optimize hard for their new goal”.
Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl’s do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
There’s also the problem of: what do you mean by “the human”? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you’re perfectly capable of making different decisions, but you just don’t.
Another problem, which I like to think of as the “control panel of the universe” problem, is where the AI gives you the “control panel of the universe”, but you aren’t smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I think any model of a rational agent needs to incorporate the fact that they’re not arbitrarily intelligent, otherwise none of their actions make sense. So I’m not too worried about this.
Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don’t know how to actually pin down these hypotheticals.