Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
That’s somewhat similar, which suggests that wireheading and bad moral updates are related.