Rohin Shah comments on Policy Alignment

Rohin Shah 19 Jul 2018 21:27 UTC
1 point
It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
I agree that “even assuming we know the true utility function, optimizing it is hard”—but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
Thanks for clarifying, that’s clearer to me now.
If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.