Saying “choose the action a for which your expectation of f(a) is highest,” doesn’t leave you any degrees of freedom. Similarly, “choose the action for which the child’s probability of survival is highest” seems pretty unambiguous (modulo definitions of counterfactuals).
I might be misunderstanding you are saying somehow.
Not sure if this is what zhukeepa means, but “choose the action for which the child’s probability of survival is highest” is very likely going to involve actions that could be interpreted as “manipulation” unless the AI deliberately places a constraint on its optimization against doing such things.
But since there is no objective standard for what is manipulation vs education or helpful information, Overseers will need to apply their subjective views (or their understanding of the user’s views) of what counts as manipulation and what doesn’t. If they get this wrong (or simply forget or fail to place the appropriate constraint on the optimization), then the user will end up being manipulated even though the AI could be considered to be genuinely trying to be helpful.
EDIT: As a matter of terminology, I might actually prefer to call this scenario a failure of corrigibility rather than corrigible but misaligned. I wonder if zhukeepa has any reasons to want to call it the latter.
Totally agree that “choose the action for which the child’s probability of survival is highest” involves manipulation (though no one was proposing that). I’m confused about the meaning of “unconstrained variables” though.
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
I think there must be a miscommunication somewhere because I don’t see how your point is a response to mine. My scenario isn’t “I ask for education or manipulation and the AI gives it to me, and bad stuff happens”, but something like this: I ask my AI to help me survive, and the AI (among other things) converts me to some religion because it thinks belonging to a church will give me a support group and help maximize my chances, and the Overseer thinks religious education is just education rather than manipulation, or mistakenly thinks I think that, or made some other mistake that failed to prevent this.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...
Saying “choose the action a for which your expectation of f(a) is highest,” doesn’t leave you any degrees of freedom. Similarly, “choose the action for which the child’s probability of survival is highest” seems pretty unambiguous (modulo definitions of counterfactuals).
I might be misunderstanding you are saying somehow.
Not sure if this is what zhukeepa means, but “choose the action for which the child’s probability of survival is highest” is very likely going to involve actions that could be interpreted as “manipulation” unless the AI deliberately places a constraint on its optimization against doing such things.
But since there is no objective standard for what is manipulation vs education or helpful information, Overseers will need to apply their subjective views (or their understanding of the user’s views) of what counts as manipulation and what doesn’t. If they get this wrong (or simply forget or fail to place the appropriate constraint on the optimization), then the user will end up being manipulated even though the AI could be considered to be genuinely trying to be helpful.
EDIT: As a matter of terminology, I might actually prefer to call this scenario a failure of corrigibility rather than corrigible but misaligned. I wonder if zhukeepa has any reasons to want to call it the latter.
Totally agree that “choose the action for which the child’s probability of survival is highest” involves manipulation (though no one was proposing that). I’m confused about the meaning of “unconstrained variables” though.
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
I think there must be a miscommunication somewhere because I don’t see how your point is a response to mine. My scenario isn’t “I ask for education or manipulation and the AI gives it to me, and bad stuff happens”, but something like this: I ask my AI to help me survive, and the AI (among other things) converts me to some religion because it thinks belonging to a church will give me a support group and help maximize my chances, and the Overseer thinks religious education is just education rather than manipulation, or mistakenly thinks I think that, or made some other mistake that failed to prevent this.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...