Wouldn’t this corruption or manipulation render the AGI incorrigible? I think not, because I don’t think corruption or manipulation are natural categories. For example, I think it’s very common for humans to unknowingly influence other humans in subtle ways while honestly believing they’re only trying to be helpful, while an onlooker might describe the same behavior as manipulative. (Section IV here provides an amusing illustration.) Likewise, I think an AGI can be manipulating us while genuinely thinking it’s helping us and being completely open with us (much like a messiah), unaware that its actions would lead us somewhere we wouldn’t currently endorse.
What do you mean by manipulation?
If the AI is optimizing its behavior to have some effect on the human, then that’s practically the central case the concept of corrigibility is intended to exclude. I don’t think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.
If the AI isn’t optimizing to influence our behavior, then I’m back to not understanding the problem. Can you flesh out this step of the argument? Is the problem that helping humans get what they short-term-want will lead to trouble? Is it something else?
If the AI is optimizing its behavior to have some effect on the human, then that’s practically the central case the concept of corrigibility is intended to exclude. I don’t think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.
See my comment re: optimizing for high approval now vs. high approval later.
If you buy (as I do) that optimizing for high approval now leaves a huge number of important variables unconstrained, I don’t see how it make sense to talk about an AI optimizing for high approval now without also optimizing to have some effect on the human, because the unconstrained variables are about effects on the human. If there were a human infant in the wilderness and you told me to optimize for keeping it alive without optimizing for any other effects on the infant, and you told me I’d be screwing up if I did optimize for other effects, I would be extremely confused about how to proceed.
If you don’t buy that optimizing for high approval now leaves a huge number of important variables unconstrained, then I agree that the AI optimizing its behavior to have some effects on the human should be ruled out by the definition of corrigibility.
Saying “choose the action a for which your expectation of f(a) is highest,” doesn’t leave you any degrees of freedom. Similarly, “choose the action for which the child’s probability of survival is highest” seems pretty unambiguous (modulo definitions of counterfactuals).
I might be misunderstanding you are saying somehow.
Not sure if this is what zhukeepa means, but “choose the action for which the child’s probability of survival is highest” is very likely going to involve actions that could be interpreted as “manipulation” unless the AI deliberately places a constraint on its optimization against doing such things.
But since there is no objective standard for what is manipulation vs education or helpful information, Overseers will need to apply their subjective views (or their understanding of the user’s views) of what counts as manipulation and what doesn’t. If they get this wrong (or simply forget or fail to place the appropriate constraint on the optimization), then the user will end up being manipulated even though the AI could be considered to be genuinely trying to be helpful.
EDIT: As a matter of terminology, I might actually prefer to call this scenario a failure of corrigibility rather than corrigible but misaligned. I wonder if zhukeepa has any reasons to want to call it the latter.
Totally agree that “choose the action for which the child’s probability of survival is highest” involves manipulation (though no one was proposing that). I’m confused about the meaning of “unconstrained variables” though.
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
I think there must be a miscommunication somewhere because I don’t see how your point is a response to mine. My scenario isn’t “I ask for education or manipulation and the AI gives it to me, and bad stuff happens”, but something like this: I ask my AI to help me survive, and the AI (among other things) converts me to some religion because it thinks belonging to a church will give me a support group and help maximize my chances, and the Overseer thinks religious education is just education rather than manipulation, or mistakenly thinks I think that, or made some other mistake that failed to prevent this.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...
What do you mean by manipulation?
If the AI is optimizing its behavior to have some effect on the human, then that’s practically the central case the concept of corrigibility is intended to exclude. I don’t think it matters what the AI thinks about what the AI is doing, it just matters what optimization power it is applying.
If the AI isn’t optimizing to influence our behavior, then I’m back to not understanding the problem. Can you flesh out this step of the argument? Is the problem that helping humans get what they short-term-want will lead to trouble? Is it something else?
See my comment re: optimizing for high approval now vs. high approval later.
If you buy (as I do) that optimizing for high approval now leaves a huge number of important variables unconstrained, I don’t see how it make sense to talk about an AI optimizing for high approval now without also optimizing to have some effect on the human, because the unconstrained variables are about effects on the human. If there were a human infant in the wilderness and you told me to optimize for keeping it alive without optimizing for any other effects on the infant, and you told me I’d be screwing up if I did optimize for other effects, I would be extremely confused about how to proceed.
If you don’t buy that optimizing for high approval now leaves a huge number of important variables unconstrained, then I agree that the AI optimizing its behavior to have some effects on the human should be ruled out by the definition of corrigibility.
Saying “choose the action a for which your expectation of f(a) is highest,” doesn’t leave you any degrees of freedom. Similarly, “choose the action for which the child’s probability of survival is highest” seems pretty unambiguous (modulo definitions of counterfactuals).
I might be misunderstanding you are saying somehow.
Not sure if this is what zhukeepa means, but “choose the action for which the child’s probability of survival is highest” is very likely going to involve actions that could be interpreted as “manipulation” unless the AI deliberately places a constraint on its optimization against doing such things.
But since there is no objective standard for what is manipulation vs education or helpful information, Overseers will need to apply their subjective views (or their understanding of the user’s views) of what counts as manipulation and what doesn’t. If they get this wrong (or simply forget or fail to place the appropriate constraint on the optimization), then the user will end up being manipulated even though the AI could be considered to be genuinely trying to be helpful.
EDIT: As a matter of terminology, I might actually prefer to call this scenario a failure of corrigibility rather than corrigible but misaligned. I wonder if zhukeepa has any reasons to want to call it the latter.
Totally agree that “choose the action for which the child’s probability of survival is highest” involves manipulation (though no one was proposing that). I’m confused about the meaning of “unconstrained variables” though.
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
I think there must be a miscommunication somewhere because I don’t see how your point is a response to mine. My scenario isn’t “I ask for education or manipulation and the AI gives it to me, and bad stuff happens”, but something like this: I ask my AI to help me survive, and the AI (among other things) converts me to some religion because it thinks belonging to a church will give me a support group and help maximize my chances, and the Overseer thinks religious education is just education rather than manipulation, or mistakenly thinks I think that, or made some other mistake that failed to prevent this.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...