I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
Thanks for clarifying, that’s clearer to me now.
If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function”—the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
Thanks for clarifying, that’s clearer to me now.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.