Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn’t have all the same properties.
At least as I read the original value-learning paper, they do follow a utility function: the maximum likelihood utility function in some distribution that is subject to Bayesian updating. The hard part was how to construct that distribution and subject it to evidence; the concept that the AI is going to want to have incorrect beliefs (since, after all, the process by which the updates are performed is epistemic, not moral) hadn’t occurred to me.
I’m afraid I still don’t see it. What is it that the AI’s trying to maximize that leads it to calculate that action b is better than action a? If it’s calculating some kind of “expected expected utility”, conservation of expected evidence still applies.
Consider: if my mother says it’s wrong to take a cookie, it will become wrong for me to take a cookie. I know she will say this, if asked, but until I do ask, I don’t consider it wrong for me to take a cookie. So I don’t ask, and I take the cookie. No conservation law: being told is not the same as knowing we would be told.
Now you may think that is a stupid way of doing things, and I agree (even though many kids and adults do reason that way). But if we want to avoid that, we need to put in conservation into the update system in some way, we can rely on it being there for every way of updating utility.
Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
The problem is that there is no conservation of expected evidence for mixed statements, unless we put it there by hand.
Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn’t have all the same properties.
At least as I read the original value-learning paper, they do follow a utility function: the maximum likelihood utility function in some distribution that is subject to Bayesian updating. The hard part was how to construct that distribution and subject it to evidence; the concept that the AI is going to want to have incorrect beliefs (since, after all, the process by which the updates are performed is epistemic, not moral) hadn’t occurred to me.
I’m afraid I still don’t see it. What is it that the AI’s trying to maximize that leads it to calculate that action b is better than action a? If it’s calculating some kind of “expected expected utility”, conservation of expected evidence still applies.
Consider: if my mother says it’s wrong to take a cookie, it will become wrong for me to take a cookie. I know she will say this, if asked, but until I do ask, I don’t consider it wrong for me to take a cookie. So I don’t ask, and I take the cookie. No conservation law: being told is not the same as knowing we would be told.
Now you may think that is a stupid way of doing things, and I agree (even though many kids and adults do reason that way). But if we want to avoid that, we need to put in conservation into the update system in some way, we can rely on it being there for every way of updating utility.
Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
That’s somewhat similar, which suggests that wireheading and bad moral updates are related.