And the “conservation” requirements cannot prevent this. Define action a:”overhear that the true morality is the hard task” and action b:”arrange to not hear that sentence”. Then obviously action b does not change its estimation of the correctness of uc or ud. But we’ve seen that action a doesn’t either! So
Expectation(p(C(uc)) | a) = Expectation(p(C(uc)) | b)
Expectation(p(C(ud)) | a) = Expectation(p(C(ud)) | b)
True but beside the point, no? Your argument is that the AI would prefer to take action b rather than action a because the expected utility of b is higher. But that’s not actually true, by conservation of expected evidence.
We can set up the same problem without needing the cake/death issue: suppose the AI knows that action a will have utility either 0.1 or 1 (with 0.5 probability for each). Then a trusted person tells it the utility is actually 0.1. Does it wish to have killed the person, or otherwise not found out that the actual utility was 0.1?
Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn’t have all the same properties.
At least as I read the original value-learning paper, they do follow a utility function: the maximum likelihood utility function in some distribution that is subject to Bayesian updating. The hard part was how to construct that distribution and subject it to evidence; the concept that the AI is going to want to have incorrect beliefs (since, after all, the process by which the updates are performed is epistemic, not moral) hadn’t occurred to me.
I’m afraid I still don’t see it. What is it that the AI’s trying to maximize that leads it to calculate that action b is better than action a? If it’s calculating some kind of “expected expected utility”, conservation of expected evidence still applies.
Consider: if my mother says it’s wrong to take a cookie, it will become wrong for me to take a cookie. I know she will say this, if asked, but until I do ask, I don’t consider it wrong for me to take a cookie. So I don’t ask, and I take the cookie. No conservation law: being told is not the same as knowing we would be told.
Now you may think that is a stupid way of doing things, and I agree (even though many kids and adults do reason that way). But if we want to avoid that, we need to put in conservation into the update system in some way, we can rely on it being there for every way of updating utility.
Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
True but beside the point, no? Your argument is that the AI would prefer to take action b rather than action a because the expected utility of b is higher. But that’s not actually true, by conservation of expected evidence.
We can set up the same problem without needing the cake/death issue: suppose the AI knows that action a will have utility either 0.1 or 1 (with 0.5 probability for each). Then a trusted person tells it the utility is actually 0.1. Does it wish to have killed the person, or otherwise not found out that the actual utility was 0.1?
The problem is that there is no conservation of expected evidence for mixed statements, unless we put it there by hand.
Key reason: value loading AIs do not follow a utility function, but a dynamic construct, that doesn’t have all the same properties.
At least as I read the original value-learning paper, they do follow a utility function: the maximum likelihood utility function in some distribution that is subject to Bayesian updating. The hard part was how to construct that distribution and subject it to evidence; the concept that the AI is going to want to have incorrect beliefs (since, after all, the process by which the updates are performed is epistemic, not moral) hadn’t occurred to me.
I’m afraid I still don’t see it. What is it that the AI’s trying to maximize that leads it to calculate that action b is better than action a? If it’s calculating some kind of “expected expected utility”, conservation of expected evidence still applies.
Consider: if my mother says it’s wrong to take a cookie, it will become wrong for me to take a cookie. I know she will say this, if asked, but until I do ask, I don’t consider it wrong for me to take a cookie. So I don’t ask, and I take the cookie. No conservation law: being told is not the same as knowing we would be told.
Now you may think that is a stupid way of doing things, and I agree (even though many kids and adults do reason that way). But if we want to avoid that, we need to put in conservation into the update system in some way, we can rely on it being there for every way of updating utility.
Ok, I think I see what you mean, but I don’t think it really depends on the mixedness of the statement, and so talking about the mixedness is just adding confusion.
If the AI is programmed to maximize some utility function—say, the product of its “utility factors” with the state-of-the-world vector—and there is some trusted programmer who is allowed to update the utility factors (or the AI knows that it updates the utility factors based on what that programmer says), then the AI may realize that the most efficient way to maximize that output is not to change the state of the world but to get the programmer to increase the values of the utility factors. So it might try to convince the programmer that starving children in Africa are a good thing (so that he’ll make the starving-children-in-africa factor less negative, because that’s easier than actually reducing the number of starving children in africa), or it might even threaten the programmer until he sets all the factors to 999 (including the one for threatening people until they do what you want).
Does that capture the problem, or were you actually making a different argument?
That’s somewhat similar, which suggests that wireheading and bad moral updates are related.