Okay, I think I misunderstood what you were claiming in this post. Based on the following line:
I claimed that an agent which learns your utility function (pretending for a moment that “your utility function” really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
Examples:
You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn’t make any sense from our perspective to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to believe are unreliable.
Yeah, an AI that optimizes the true utility function probably won’t be corrigible. From a theoretical standpoint, that seems fine—corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don’t want the scenario above is “we have good reason to believe [the AI is] unreliable”, which sounds like the AI is failing to optimize the utility function correctly.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)
We can go in the opposite extreme, and make PR a broad prior such as the Solomonoff distribution, with no information about our world in particular.
I believe the observation has been made before that running UDT on such a prior could have weird results.
Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.
So I guess the way I would summarize this post:
Value learning is hard.
Even if you know the correct utility function, optimizing it is hard.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
Thanks for clarifying, that’s clearer to me now.
If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.
I’m not even sure whether you are closer or further from understanding what I meant, now.
:(
I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies.
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
Okay, I think I misunderstood what you were claiming in this post. Based on the following line:
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
Examples:
Yeah, an AI that optimizes the true utility function probably won’t be corrigible. From a theoretical standpoint, that seems fine—corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don’t want the scenario above is “we have good reason to believe [the AI is] unreliable”, which sounds like the AI is failing to optimize the utility function correctly.
This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)
Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.
So I guess the way I would summarize this post:
Value learning is hard.
Even if you know the correct utility function, optimizing it is hard.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
Is this right?
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
Thanks for clarifying, that’s clearer to me now.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.
:(
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.